Introduction to MongoDb with .NET part 25: adding multiple stages to the aggregation pipeline
May 9, 2016 2 Comments
Introduction
In the previous post we continued our discussion of the grouping stage in the MongoDb aggregation pipeline. We looked at a couple of $ operators that can be used in aggregations, such as $min, $avg and $addToSet. We also saw how to use compound IDs to add multiple grouping keys. Compound IDs are also useful when we want to give an explanatory field to the compulsory _id property so that it’s clear what the grouping key is.
In this post we’ll start adding multiple stages to the pipeline. So far we’ve had simple aggregation pipelines with one stage but there’s nothing stopping us from adding more to the stage array. It’s also possible to add multiple instances of the same stage type.
Projections
Projections can reshape documents into various forms. We can add new properties, remove properties that are not needed, transform sub-documents so that they can be aggregated in an easier way. A projection stage is denoted by the $project operator. Keep in mind that the projection won’t modify the source documents at all. It will rather create new documents in-memory and pass them along to the next stage in the pipeline if any.
Let’s start with an easy example and transform the zip code documents a little:
db.zipcodes.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])
- _id: 0 means that we don’t want to show the _id field in the result set
- $toLower: this puts every character in a string into lower case. There’s a corresponding $toUpper operator
- pop: 1 and state: 1 mean that we want to keep those fields from the source documents
- zip: “$_id” means that we want to rename the _id field into zip
Here’s a sample output:
{ "city" : "agawam", "pop" : 15338, "state" : "MA", "zip" : "01001" } { "city" : "cushman", "pop" : 36963, "state" : "MA", "zip" : "01002" } { "city" : "belchertown", "pop" : 10579, "state" : "MA", "zip" : "01007" } { "city" : "blandford", "pop" : 1240, "state" : "MA", "zip" : "01008" } { "city" : "barre", "pop" : 4546, "state" : "MA", "zip" : "01005" }
It’s time to show the first extended pipeline. We’ll add a grouping stage after the projection stage. The important point is that the grouping stage will refer back to the properties in the documents coming from the stage before that, i.e. the projection. Here’s the average population by zip code:
db.zipcodes.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}, {$group : {_id : {"zipcode" : "$zip"}, "average_pop" : {$avg : "$pop"} }}])
Notice how we refer back to the projected zip field in the group stage to build a compound ID and show the meaning of the grouping key.
Filtering
It happens that we don’t want to include all documents in the aggregation. The $match aggregation stage works similarly to how we used the find() function by adding a JSON document as a filter.
The next example will filter out the places in the zipcodes collection whose population is less than 50k:
db.zipcodes.aggregate([{$match:{pop:{$gte:50000}}}])
We’ll also transform those documents:
db.zipcodes.aggregate([{$match:{pop:{$gte:50000}}}, {$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])
…and finally we can group them, meaning we’ll group the places whose population is over 50k and calculate their average population sizes:
db.zipcodes.aggregate([ {$match:{pop:{$gte:50000}}}, {$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}, {$group : {_id : {"zipcode" : "$zip"}, "average_pop" : {$avg : "$pop"}}} ])
It can be a bit difficult for the eyes to follow all these curly braces in more complex aggregation queries. These 3 stages are still simple though.
The good thing with this modular setup is that you can build up your aggregation pipeline step by step and view the partial results in the Mongo client.
Sorting
We can also sort the end result using the $sort aggregation stage. Again it’s similar to how we did sorting in the Mongo shell before using 1 for ascending and -1 for descending order:
db.zipcodes.aggregate([ {$match:{pop:{$gte:50000}}}, {$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}, {$group : {_id : {"city_name" : "$city"}, "average_pop" : {$avg : "$pop"}}}, {$sort: {"_id.city_name" : 1}} ])
Note how we referred back to the city_name of the compound ID of the grouping stage from the sorting stage using the dot notation we saw before.
Here’s part of the result sorted by city name in ascending order:
{ "_id" : { "city_name" : "aiken" }, "average_pop" : 51233 } { "_id" : { "city_name" : "albuquerque" }, "average_pop" : 50233 } { "_id" : { "city_name" : "alhambra" }, "average_pop" : 51148 } { "_id" : { "city_name" : "allentown" }, "average_pop" : 59074 } { "_id" : { "city_name" : "alton" }, "average_pop" : 67604 }
Limiting and skipping
We can of course do limiting and skipping on the result set just like we did when filtering the results in the find function. They are denoted by the $limit and $skip operators and both accept integer values:
db.zipcodes.aggregate([ {$match:{pop:{$gte:50000}}}, {$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}, {$group : {_id : {"city_name" : "$city"}, "average_pop" : {$avg : "$pop"}}}, {$sort: {"_id.city_name" : 1}}, {$skip: 10}, {$limit: 5} ])
It’s the same query as above but we skip the first ten documents and select at most 5. Here’s the result set:
{ "_id" : { "city_name" : "arlington height" }, "average_pop" : 52947 } { "_id" : { "city_name" : "aspen hill" }, "average_pop" : 52694 } { "_id" : { "city_name" : "atlanta" }, "average_pop" : 53894 } { "_id" : { "city_name" : "aurora" }, "average_pop" : 51422 } { "_id" : { "city_name" : "azusa" }, "average_pop" : 52261 }
We’ll close this post with a final example that uses the above operators. We want to group all cities in the state of New York and sum up their populations. Then we project the documents into a neater result set, sort the documents by population in descending order, skip the first 10 documents and limit the result set to 5 documents:
db.zipcodes.aggregate([ {$match: { state:"NY" } }, {$group: { _id: "$city", population: {$sum:"$pop"}, } }, {$project: { _id: 0, city: "$_id", population: 1, } }, {$sort: { population:-1 } }, {$skip: 10}, {$limit: 5} ])
Here’s the result:
{ "population" : 165629, "city" : "ASTORIA" } { "population" : 145967, "city" : "JACKSON HEIGHTS" } { "population" : 100646, "city" : "FAR ROCKAWAY" } { "population" : 85732, "city" : "RIDGEWOOD" } { "population" : 83017, "city" : "BINGHAMTON" }
We’ll look at more aggregation stage types in the next post.
You can view all posts related to data storage on this blog here.
I wanted to offer a correction for the final query. The collection name is listed as “zipcpdes” but should be “zipcpoes”. I’m embarrassed at how long it took me to figure this out.
Thanks for the correction, it should be “zipcodes”.
//Andras