Introduction to MongoDb with .NET part 24: aggregations continued
May 6, 2016 Leave a comment
Introduction
In the previous post we looked at our first aggregation example. We grouped the zipcodes collection by state and number of inhabitants in each state. A grouping stage is represented by the $group operator and it’s followed by a number of arguments. We went through the process that MongoDb performs to produce the final set of documents. Essentially it performs a series of upsert operations on the starting collection.
In this post we’ll look at some other query examples.
Calculate the average population by state
The $avg operator instructs MongoDb to calculate the average of a numeric field. The query is almost identical to the $sum query we saw in the previous post:
db.zipcodes.aggregate([{"$group":{"_id":"$state", "population":{$avg:"$pop"}}}])
Here’s part of the output:
{ "_id" : "MN", "population" : 4958.02947845805 } { "_id" : "SC", "population" : 9962.00857142857 } { "_id" : "RI", "population" : 14539.391304347826 } { "_id" : "OK", "population" : 5367.892491467577 } { "_id" : "MA", "population" : 12692.879746835442 } { "_id" : "SD", "population" : 1810.9296875 }
Compound IDs
The _id field in the above output may not be self-explanatory. What is MN and RI and the other codes? For some people it will be obvious, but some other may be left wondering. We can make it clear by assigning a subdocument to the _id field as follows:
db.zipcodes.aggregate([{"$group":{"_id":{"us_state": "$state"}, "population":{$sum:"$pop"}}}])
Here’s part of the output:
{ "_id" : { "us_state" : "MN" }, "population" : 4372982 } { "_id" : { "us_state" : "SC" }, "population" : 3486703 } { "_id" : { "us_state" : "RI" }, "population" : 1003218 } { "_id" : { "us_state" : "OK" }, "population" : 3145585 } { "_id" : { "us_state" : "MA" }, "population" : 6016425 }
We are free to name the key(s) in the _id subdocument in any way we want to. Compound _id fields will also be important for compound grouping stages as well later on in this post.
Calculate min and max
What about the places with minimum and maximum populations by state? Meet the $min and $max operators. Here’s the example for $max:
db.zipcodes.aggregate([{"$group":{"_id":"$state", "max_pop":{$max:"$pop"}}}])
Here’s part of the output:
{ "_id" : "MN", "max_pop" : 51421 } { "_id" : "SC", "max_pop" : 66990 } { "_id" : "RI", "max_pop" : 53733 } { "_id" : "OK", "max_pop" : 45542 } { "_id" : "MA", "max_pop" : 65046 }
There’s nothing stopping us from adding more aggregation operators, like in the following example where we show both the min and max populations:
db.zipcodes.aggregate([{"$group":{"_id":"$state", "max_pop":{$max:"$pop"}, "min_pop": {"$min" : "$pop"}}}])
…which results in the following:
{ "_id" : "NY", "max_pop" : 111396, "min_pop" : 0 } { "_id" : "DE", "max_pop" : 50573, "min_pop" : 108 } { "_id" : "DC", "max_pop" : 62924, "min_pop" : 11 } { "_id" : "OR", "max_pop" : 48007, "min_pop" : 0 } { "_id" : "MD", "max_pop" : 76002, "min_pop" : 1 }
Extract values into arrays
Let’s try something on the restaurants collection this time. We want to find the unique cuisine types by borough, e.g. what types of restaurant Manhattan has to offer. The $addToSet operator comes in handy here which we saw in action before. It adds unique values to an array, Here’s the query:
db.restaurants.aggregate([{"$group":{"_id":"$borough", "cuisine_types":{$addToSet:"$cuisine"}}}])
Here’s part of the output for the borough of Staten Island:
{ "_id" : "Staten Island", "cuisine_types" : [ "Italian", "Ice Cream, Gelato, Yogurt, Ices", "Jewish/Kosher", "Armenian", "Pizza/Italian", "Peruvian", "Fruits/Vegetables", "Delicatessen", "American ", "Russian", "Hamburgers", "Chinese", "Spanish", "Barbecue", "Pizza", "Hotdogs/Pretzels", "Japanese", "Donuts", "Bagels/Pretzels", "Korean", "Indian", "Bakery", "Café/Coffee/Tea", "Juice, Smoothies, Fruit Salads", "German", "Mexican", "Irish", "Chicken", "Sandwiches", "Caribbean", "Cajun", "French", "Mediterranean", "Chinese/Japanese", "Seafood", "Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Pancakes/Waffles", "Hotdogs", "Turkish", "Sandwiches/Salads/Mixed Buffet", "Tex-Mex", "Thai", "Southwestern", "Middle Eastern", "Continental", "Filipino", "Asian", "Bottled beverages, including water, sodas, juices, etc.", "Soups", "Steak", "Other", "Eastern European", "African", "Not Listed/Not Applicable", "Portuguese", "Polish" ] }
Another example is if we want to extract the unique postal codes for each city:
db.zipcodes.aggregate([{"$group":{"_id":"$city", "postal_codes":{$addToSet:"$_id"}}}])
Here’s a short sample output:
{ "_id" : "OTIS", "postal_codes" : [ "01253", "71466", "80743", "97368" ] } { "_id" : "BARRE", "postal_codes" : [ "01005", "05641" ] } { "_id" : "NANTY GLO", "postal_codes" : [ "15943" ] } { "_id" : "BIRCH HARBOR", "postal_codes" : [ "04613" ] } { "_id" : "COEYMANS HOLLOW", "postal_codes" : [ "12046" ] } { "_id" : "HOLYOKE", "postal_codes" : [ "55749", "01040", "80734" ] }
Do you recall the $push operator? It is similar to $addToSet but it also adds non-unique values to arrays. The above example can in fact be rewritten with $push as all zipcode documents have a unique postal code ID so there’s no chance of ending up with duplicated records in the “postal_codes” array. Here’s the solution with $push instead:
db.zipcodes.aggregate([{"$group":{"_id":"$city", "postal_codes":{$push:"$_id"}}}])
It will produce the same set of documents as the $addToSet solution.
Compound aggregations
We can group by multiple items just like in standard SQL. The compound _id technique we saw above is the key to build compound grouping keys. Let’s say we want to group the restaurants by borough and cuisine and see how many restaurants there are for each combination, e.g. how many “Hamburgers” restaurants there are in Queens:
db.restaurants.aggregate([{$group : {_id : {"borough" : "$borough", "cuisine":"$cuisine"}, "restaurants_count" : {"$sum" : 1}}}])
Here’s some sample output:
{ "_id" : { "borough" : "Brooklyn", "cuisine" : "Soups & Sandwiches" }, "restaurants_count" : 5 } { "_id" : { "borough" : "Brooklyn", "cuisine" : "Hamburgers" }, "restaurants_count" : 102 } { "_id" : { "borough" : "Queens", "cuisine" : "Pakistani" }, "restaurants_count" : 11 } { "_id" : { "borough" : "Manhattan", "cuisine" : "Chinese" }, "restaurants_count" : 510 }
So far in this part of the series on aggregations we’ve only seen simple pipelines with a single stage. However, it’s called a pipeline since it can be extended with multiple stages. You’ve probably noticed that the single grouping stage is enclosed in an array that we supply to the aggregate function. That array can be extended of course with other stages.
We’ll continue our discussion with multiple stages and other stage types in the next post.
You can view all posts related to data storage on this blog here.