Forward Advance Learning

Introducing Map Reduce

Map reduce is an algorithm that allows you to distribute a merge operation across a large number of machines.

If you have data distributed across a cluster, each machine in the cluster can map a few documents feeding the result forward. Furthermore, each machine can handle a few reduce operations, potentially distributing the workload very widely.

Prior to Mongo 2.6, Map reduce was the only way to run queries on Mongo. The aggregate framework has largely replaced Map reduce, but we can still use it for complex queries.

Speed

Because Map reduce runs JavaScript functions, it can be slower than the equivalent aggregate pipeline. Favour aggregate operations where possible.

Map

The Map stage accepts a single document and converts it to a usable form. Say you are summing all the fleas on a herd of elephants, the map stage might output just the fleacount for a single elephant. It emits the value. It can emit many values.

var map = function() {
emit('flea_count', this.flea_count)
emit('tick_count', this.tick_count)
}

Reduce

The reduce stage accepts multiple mapped inputs and aggregates them in some way, perhaps adding them to an array, or summing them.

var reduce = function(id, counts) {
return Array.sum(counts)
}

Finally we issue the query:

db.people.mapReduce(
map,
reduce,
{
out: { inline: 1 }
}
)

read more about map reduce here: http://docs.mongodb.org/manual/reference/method/db.collection.mapReduce/

When to use this?

The Aggregate Framework uses Map Reduce behind the scenes, so for most cases you don't need to interact with it directly. However, there are cases when you need to go beyond the built in tools and do something more complex.

For example, I had data from a competition. People could enter multiple times. Sometimes they had filled in all the details, sometimes some, sometimes none. I wanted to combine these results using email as an id, and I needed some complex logic to create a single valid record.

Exercise - Map Reduce

  • Use Map reduce to take the people data and count the total age of everyone.
  • Count the total age of all the cats.