Introduction to MongoDb with .NET part 26: more aggregation types and examples
May 10, 2016 Leave a comment
Introduction
In the previous post we looked at how to add multiple stages to a MongoDb aggregation pipeline. We saw that it was easy and straightforward to extend the stages array of the pipeline. It’s important to remember that each stage acts upon the set of documents produced by the previous stage. We can also have the same stage type appear more than once in the pipeline, i.e. there’s no limit saying we can only have one $group operation in a pipeline. The various stages can also appear in various order, we don’t have to follow some type of ordering either.
In this post we’ll continue where we left off and look at a couple more aggregation operators and examples.
First and last
We now know how to sort the result set of documents using the $sort operator. An extra step after sorting can be to select the first or last document of the documents. This is useful for queries such as “what is the largest city in Massachusetts” or “which student has the worst grades in maths”.
Let’s try to find the largest city in each state from the zipcodes collection. Bear in mind that the same city can appear multiple times in the zipcodes collection since a city can have multiple zip codes. Hence the first stage will be grouping by state and city where we sum up the population of each city:
db.zipcodes.aggregate([ {$group : { "_id" : {"state" : "$state", "city" : "$city"}, "population" : {$sum : "$pop"}}} ])
Next we sort the set of documents by state and population:
db.zipcodes.aggregate([ {$group : { "_id" : {"state" : "$state", "city" : "$city"}, "population" : {$sum : "$pop"}}}, {$sort: {"_id.state" : 1, "population" : -1}} ])
We can now select the first document for each state in a second grouping stage. We’ll select the first document by city and population:
db.zipcodes.aggregate([ {$group : { "_id" : {"state" : "$state", "city" : "$city"}, "population" : {$sum : "$pop"}}}, {$sort: {"_id.state" : 1, "population" : -1}}, {$group: {"_id" : {"state" : "$_id.state"}, "largest_city" : {$first: "$_id.city"}, "population" : {$first: "$population"}}} ])
Here’s a sample set of documents to show where we are now:
{ "_id" : { "state" : "PA" }, "largest_city" : "PHILADELPHIA", "population" : 1610956 } { "_id" : { "state" : "CO" }, "largest_city" : "DENVER", "population" : 451182 } { "_id" : { "state" : "OH" }, "largest_city" : "CLEVELAND", "population" : 536759 } { "_id" : { "state" : "AZ" }, "largest_city" : "PHOENIX", "population" : 890853 } { "_id" : { "state" : "KY" }, "largest_city" : "LOUISVILLE", "population" : 288058 }
The second grouping stage seems to have messed up the sorting of the first sort stage so let’s add another one and sort the documents by state:
db.zipcodes.aggregate([ {$group : { "_id" : {"state" : "$state", "city" : "$city"}, "population" : {$sum : "$pop"}}}, {$sort: {"_id.state" : 1, "population" : -1}}, {$group: {"_id" : {"state" : "$_id.state"}, "largest_city" : {$first: "$_id.city"}, "population" : {$first: "$population"}}}, {$sort : {"_id.state" : 1}} ])
It’s now sorted by state.
We can also introduce a projection stage at the end just to make the documents neater. We get rid of the _id field and replace it with a “state” field instead:
db.zipcodes.aggregate([ {$group : { "_id" : {"state" : "$state", "city" : "$city"}, "population" : {$sum : "$pop"}}}, {$sort: {"_id.state" : 1, "population" : -1}}, {$group: {"_id" : {"state" : "$_id.state"}, "largest_city" : {$first: "$_id.city"}, "population" : {$first: "$population"}}}, {$sort : {"_id.state" : 1}}, {$project : {"_id" : 0, "state" : "$_id.state" ,"largest_city" : 1, "population": 1}} ])
Here’s a sample result set:
{ "largest_city" : "ANCHORAGE", "population" : 183987, "state" : "AK" } { "largest_city" : "BIRMINGHAM", "population" : 242606, "state" : "AL" } { "largest_city" : "LITTLE ROCK", "population" : 192895, "state" : "AR" } { "largest_city" : "PHOENIX", "population" : 890853, "state" : "AZ" } { "largest_city" : "LOS ANGELES", "population" : 2102295, "state" : "CA" }
Array unwinding
It can be difficult to aggregate the documents by values in an array. Consider the following people collection:
{ "_id" : ObjectId("572d6f1e7651d0f43f521054"), "name" : "susan", "sports" : [ "badminton", "squash" ] } { "_id" : ObjectId("572d6f347651d0f43f521055"), "name" : "paul", "sports" : [ "badminton", "squash", "football" ] } { "_id" : ObjectId("572d6f4c7651d0f43f521056"), "name" : "jane", "sports" : [ "squash", "baseball" ] } { "_id" : ObjectId("572d6f737651d0f43f521057"), "name" : "rick", "sports" : [ "baseball", "tennis", "basketball" ] }
It’s difficult to perform an aggregation on the values in the sports array like that. The $unwind operator breaks out those values:
db.people.aggregate([ {"$unwind" : "$sports"} ])
…and here’s what the result set looks like:
{ "_id" : ObjectId("572d6f1e7651d0f43f521054"), "name" : "susan", "sports" : "badminton" } { "_id" : ObjectId("572d6f1e7651d0f43f521054"), "name" : "susan", "sports" : "squash" } { "_id" : ObjectId("572d6f347651d0f43f521055"), "name" : "paul", "sports" : "badminton" } { "_id" : ObjectId("572d6f347651d0f43f521055"), "name" : "paul", "sports" : "squash" } { "_id" : ObjectId("572d6f347651d0f43f521055"), "name" : "paul", "sports" : "football" } { "_id" : ObjectId("572d6f4c7651d0f43f521056"), "name" : "jane", "sports" : "squash" } { "_id" : ObjectId("572d6f4c7651d0f43f521056"), "name" : "jane", "sports" : "baseball" } { "_id" : ObjectId("572d6f737651d0f43f521057"), "name" : "rick", "sports" : "baseball" } { "_id" : ObjectId("572d6f737651d0f43f521057"), "name" : "rick", "sports" : "tennis" } { "_id" : ObjectId("572d6f737651d0f43f521057"), "name" : "rick", "sports" : "basketball" }
So for each original document there will be as many resulting documents as there are entries in the sports array. Susan has now 2 documents, Paul 3 and so on. We can now count the frequency of each sport in a grouping stage:
db.people.aggregate([ {"$unwind" : "$sports"}, {"$group" : {"_id" : "$sports", "count" : {$sum: 1}}} ])
Here’s the result:
{ "_id" : "tennis", "count" : 1 } { "_id" : "badminton", "count" : 2 } { "_id" : "squash", "count" : 3 } { "_id" : "football", "count" : 1 } { "_id" : "baseball", "count" : 2 } { "_id" : "basketball", "count" : 1 }
$unwind was the last aggregation operator that we’ll handle in this introduction to MongoDb aggregation.
The following documentation pages will describe all aggregation types and operators:
- https://docs.mongodb.com/manual/aggregation/
- https://docs.mongodb.com/manual/reference/operator/aggregation/
- SQL to MongoDb aggregation mapping chart
We’ll wrap up with two other examples.
Average population of filtered cities and selected states
First off we want to calculate the average population of those cities in California and New York whose population is over and above 25000.
We’ll start with the matching stage where we only want to include the places in CA and NY:
db.zipcodes.aggregate([ {$match: { "state" : {$in : ["CA", "NY"]} } } ])
Next we sum up the population of each city:
db.zipcodes.aggregate([ {$match: { "state" : {$in : ["CA", "NY"]} } }, {"$group": { "_id":"$city", "population":{$sum:"$pop"} } } ])
The above step is necessary since the same city can have multiple zip codes so they can figure 2 or more times in the collection.
The next step is again a matching step where we only want to include the cities with populations over 25000:
db.zipcodes.aggregate([ {$match: { "state" : {$in : ["CA", "NY"]} } }, {"$group": { "_id":"$city", "population":{$sum:"$pop"} } }, {$match: { "population" : {$gt : 25000} } } ])
We’re now ready to calculate the overall average in the group stage. We don’t want to have any grouping ID, we want a single number. A little trick here is to have a null value for the _id field. We’ll also add a final project stage where we suppress the null _id field and output the average population:
db.zipcodes.aggregate([ {$match: { "state" : {$in : ["CA", "NY"]} } }, {"$group": { "_id":"$city", "population":{$sum:"$pop"} } }, {$match: { "population" : {$gt : 25000} } }, {$group: { "_id" : null, "pop" : {$avg: "$population"} } }, {$project: { "_id" : 0, "average_population" : "$pop" } } ])
Here’s the response:
{ "average_population" : 83199.93478260869 }
Calculate total population of cities with A, B and C
We want to calculate the average population of those cities whose names start with A, B or C. Here’s the complete query:
db.zipcodes.aggregate([ {$project: { "city" : "$city", "population" : "$pop", "first_char": {$substr : ["$city",0,1]}, } }, {$match: { "first_char" : {$in : ["A", "B", "C"]} } }, {$group: { "_id" : null, "sum_population" : {$sum: "$population"} } } ])
The $substr operator is new. It extracts a substring from a source string. In this case we are interested in the first character in each city name. We then have the $in operator to include only those documents where the first_char property is either A, B or C. Finally we calculate the sum of their populations. The answer is 55582457.
In the next post we’ll look at how aggregations can be written in the .NET driver.
You can view all posts related to data storage on this blog here.