MongoDB Map Reduce
Map-Reduce is a computational model that, in simple terms, breaks down a large amount of work (data) into smaller tasks (MAP) and then combines the results into a final outcome (REDUCE).
MongoDB's Map-Reduce is highly flexible and quite practical for large-scale data analysis.
MapReduce Command
The following is the basic syntax for MapReduce:
>db.collection.mapReduce(
function() {emit(key,value);}, //map function
function(key,values) {return reduceFunction}, //reduce function
{
out: collection,
query: document,
sort: document,
limit: number
}
)
To use MapReduce, you need to implement two functions: the Map function and the Reduce function. The Map function calls emit(key, value) and iterates through all records in the collection, passing the key and value to the Reduce function for processing.
The Map function must call emit(key, value) to return key-value pairs.
Parameter Explanation:
map: Mapping function (generates a sequence of key-value pairs, used as parameters for the reduce function).
reduce: Statistical function. The task of the reduce function is to convert key-values into key-value, i.e., to transform the values array into a single value.
out: Collection where the statistical results are stored (if not specified, a temporary collection is used, which is automatically deleted after the client disconnects).
query: A filtering condition. Only documents that meet the condition will invoke the map function. (query, limit, sort can be combined arbitrarily)
sort: Sorting parameter combined with limit (also sorts the documents before sending them to the map function), which can optimize the grouping mechanism.
limit: The upper limit of the number of documents sent to the map function (without limit, the use of sort alone is not very effective).
The following example finds data with status:"A" in the orders collection, groups them by cust_id, and calculates the total amount.
Using MapReduce
Consider the following document structure storing user posts, with documents containing the user's user_name and the post's status field:
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "mark",
"status":"active"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "mark",
"status":"active"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "mark",
"status":"active"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "mark",
"status":"active"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "mark",
"status":"disabled"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "tutorialpro",
"status":"disabled"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "tutorialpro",
"status":"disabled"
})
WriteResult({ "nInserted" : 1 })
>db.posts.insert({
"post_text": "tutorialpro.org, the most comprehensive technical documentation.",
"user_name": "tutorialpro",
"status": "active"
})
WriteResult({ "nInserted" : 1 })
Now, we will use the mapReduce function in the posts collection to select published articles (status: "active") and group them by user_name to count the number of articles per user:
>db.posts.mapReduce(
function() { emit(this.user_name, 1); },
function(key, values) { return Array.sum(values); },
{
query: { status: "active" },
out: "post_total"
}
)
The above mapReduce output is:
{
"result": "post_total",
"timeMillis": 23,
"counts": {
"input": 5,
"emit": 5,
"reduce": 1,
"output": 2
},
"ok": 1
}
The results indicate that there are 5 documents that meet the query condition (status: "active"), 5 key-value pair documents were generated in the map function, and finally, the reduce function grouped the same keys into 2 groups.
Specific parameter explanations:
result: The name of the collection where the results are stored, which is a temporary collection and will be automatically deleted after the MapReduce connection is closed.
timeMillis: The time taken to execute, in milliseconds.
input: The number of documents that met the conditions and were sent to the map function.
emit: The number of times emit was called in the map function, which is the total amount of data in the collection.
output: The number of documents in the result collection.
ok: Whether it was successful, 1 for success.
err: If it failed, this could contain the reason for the failure, although the reason tends to be vague and not very useful.
To view the mapReduce query results using the find operator:
> var map = function() { emit(this.user_name, 1); }
> var reduce = function(key, values) { return Array.sum(values); }
> var options = { query: { status: "active" }, out: "post_total" }
> db.posts.mapReduce(map, reduce, options)
{ "result": "post_total", "ok": 1 }
> db.post_total.find();
The above query displays the following results:
{ "_id": "mark", "value": 4 }
{ "_id": "tutorialpro", "value": 1 }
In a similar manner, MapReduce can be used to build large and complex aggregation queries.
The Map and Reduce functions can be implemented using JavaScript, making MapReduce very flexible and powerful. ```