Introduction to MongoDb with .NET part 23: first aggregation examples
May 5, 2016 Leave a comment
Introduction
In the previous post we introduced the topic of aggregations in general. Aggregations in MongoDb are similar in purpose to the GROUP BY clause in SQL. They help us analyse the data stored in the database so that understand things like long term business trends, user and customer behaviour, which part of our business is making losses and similar. Aggregation queries are added to the aggregation pipeline using $ operators and the documents are passed from one stage to the next in the pipeline. We will get the aggregation result at the end of the pipeline.
In this post we’ll look at our first aggregation example.
Find the total population by state
Just as a reminder here’s one document from the zipcodes demo collection:
{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" }
In this first exercise we want to group the documents by the state and sum up the population in each state.
We’ll start with the solution and then we’ll go through an explanation. Here’s the query in the Mongo client:
db.zipcodes.aggregate([{"$group":{"_id":"$state", "population":{$sum:"$pop"}}}])
Aggregations are represented by the “aggregate” function that accepts an array of query documents. Each query document represents a stage in the pipeline that in some way transforms the starting collection.
The collection we start with is the entire zipcodes collection of course. MongoDb will go through each document one by one and perform the assigned aggregation stage on them to yield other documents. Here’s part of the outcome of the aggregation:
{ "_id" : "MN", "population" : 4372982 } { "_id" : "SC", "population" : 3486703 } { "_id" : "RI", "population" : 1003218 } { "_id" : "OK", "population" : 3145585 } { "_id" : "MA", "population" : 6016425 } { "_id" : "SD", "population" : 695397 }
So we have one document by state.
The $group operator indicates that we want to group the collection by some key. The $group operator, like all aggregation stage operators, accepts a JSON document with various arguments. The key which we want to group the collection by is indicated by the _id field. It’s a compulsory argument and must be called _id. If you write “id” or “mickeymouse” instead then you’ll get an error:
“errmsg” : “the group aggregate field ‘id’ must be defined as an expression inside an object”
The key will be a field from the source documents and we’ll aggregate by the “state” property. Fields from the source documents must be prefixed with the $ sign, hence the $state input. If you miss out the $ operator it will be understood as a standalone field in the resulting documents:
db.zipcodes.aggregate([{"$group":{"_id":"state", "population":{$sum:"$pop"}}}])
The above query results in the total population in the collection regardless of the state:
{ "_id" : "state", "population" : 248408400 }
The next property in the query document is “population”. We want to add a new field to the resulting documents called “population” and we want it to be the sum of the population.
The $sum operator is equivalent to the SQL COUNT() function, it simply counts stuff. We again refer back to an existing property in the zipcodes collection called $pop.
So we have an aggregation pipeline with only one stage at this point. MongoDb will look at each document in the collection and produce the resulting collections approximately as follows:
- It takes the first document and sees state = MA
- It will try to update the result set of documents by this key
- Since this is the first document with this key MongoDb will insert a new document to the result set with key = MA
- MongoDb sees that it must add the population of the first document to the result document and since that is 0 at this point it simply adds 15338 to 0
- At this point we have the first document with the key – _id – MA and a property called “population” whose value is 15338
- Then comes the next document which might have the state SC and since there’s no result document by that key yet it will be added in the same way as above for “MA”
- Eventually MongoDb will encounter another document with the state “MA”
- Since there’s already a result document with that ID the population of the second document will simply be added to the value in the result set
- …and the process continues until all documents have been processed
MongoDb will therefore perform a series of upserts in this process: INSERT a new document to the result set if there’s no document with a given _id yet or UPDATE in case there’s one. At the end of the process, i.e. when all the zipcode documents have been accounted for we’ll get the resulting set of documents.
The $sum operator can also be used like COUNT(*) in SQL, i.e. just to count the occurrences of a record. E.g. if we want to get the number of zip codes by state we can have the following query:
db.zipcodes.aggregate([{"$group":{"_id":"$state", "count":{$sum: 1}}}])
So we have $sum : 1 instead of $sum : $pop. We don’t refer back to any property in the source documents but simply ask MongoDb to add 1 to the count field while going through the source documents. Here’s part of the result:
{ "_id" : "MN", "count" : 882 } { "_id" : "SC", "count" : 350 } { "_id" : "RI", "count" : 69 } { "_id" : "OK", "count" : 586 } { "_id" : "MA", "count" : 474 }
We’ll keep looking at aggregations in the next post.
You can view all posts related to data storage on this blog here.