Introduction to MongoDb with .NET part 24: aggregations continued

Introduction

In the previous post we looked at our first aggregation example. We grouped the zipcodes collection by state and number of inhabitants in each state. A grouping stage is represented by the $group operator and it’s followed by a number of arguments. We went through the process that MongoDb performs to produce the final set of documents. Essentially it performs a series of upsert operations on the starting collection.

In this post we’ll look at some other query examples.

Calculate the average population by state

The $avg operator instructs MongoDb to calculate the average of a numeric field. The query is almost identical to the $sum query we saw in the previous post:

db.zipcodes.aggregate([{"$group":{"_id":"$state", "population":{$avg:"$pop"}}}])

Here’s part of the output:

{ "_id" : "MN", "population" : 4958.02947845805 }
{ "_id" : "SC", "population" : 9962.00857142857 }
{ "_id" : "RI", "population" : 14539.391304347826 }
{ "_id" : "OK", "population" : 5367.892491467577 }
{ "_id" : "MA", "population" : 12692.879746835442 }
{ "_id" : "SD", "population" : 1810.9296875 }

Compound IDs

The _id field in the above output may not be self-explanatory. What is MN and RI and the other codes? For some people it will be obvious, but some other may be left wondering. We can make it clear by assigning a subdocument to the _id field as follows:

db.zipcodes.aggregate([{"$group":{"_id":{"us_state": "$state"}, "population":{$sum:"$pop"}}}])

Here’s part of the output:

{ "_id" : { "us_state" : "MN" }, "population" : 4372982 }
{ "_id" : { "us_state" : "SC" }, "population" : 3486703 }
{ "_id" : { "us_state" : "RI" }, "population" : 1003218 }
{ "_id" : { "us_state" : "OK" }, "population" : 3145585 }
{ "_id" : { "us_state" : "MA" }, "population" : 6016425 }

We are free to name the key(s) in the _id subdocument in any way we want to. Compound _id fields will also be important for compound grouping stages as well later on in this post.

Calculate min and max

What about the places with minimum and maximum populations by state? Meet the $min and $max operators. Here’s the example for $max:

db.zipcodes.aggregate([{"$group":{"_id":"$state", "max_pop":{$max:"$pop"}}}])

Here’s part of the output:

{ "_id" : "MN", "max_pop" : 51421 }
{ "_id" : "SC", "max_pop" : 66990 }
{ "_id" : "RI", "max_pop" : 53733 }
{ "_id" : "OK", "max_pop" : 45542 }
{ "_id" : "MA", "max_pop" : 65046 }

There’s nothing stopping us from adding more aggregation operators, like in the following example where we show both the min and max populations:

db.zipcodes.aggregate([{"$group":{"_id":"$state", "max_pop":{$max:"$pop"}, "min_pop": {"$min" : "$pop"}}}])

…which results in the following:

{ "_id" : "NY", "max_pop" : 111396, "min_pop" : 0 }
{ "_id" : "DE", "max_pop" : 50573, "min_pop" : 108 }
{ "_id" : "DC", "max_pop" : 62924, "min_pop" : 11 }
{ "_id" : "OR", "max_pop" : 48007, "min_pop" : 0 }
{ "_id" : "MD", "max_pop" : 76002, "min_pop" : 1 }

Extract values into arrays

Let’s try something on the restaurants collection this time. We want to find the unique cuisine types by borough, e.g. what types of restaurant Manhattan has to offer. The $addToSet operator comes in handy here which we saw in action before. It adds unique values to an array, Here’s the query:

db.restaurants.aggregate([{"$group":{"_id":"$borough", "cuisine_types":{$addToSet:"$cuisine"}}}])

Here’s part of the output for the borough of Staten Island:

{ "_id" : "Staten Island", "cuisine_types" : [ "Italian", "Ice Cream, Gelato, Yogurt, Ices", "Jewish/Kosher", "Armenian", "Pizza/Italian", "Peruvian", "Fruits/Vegetables", "Delicatessen", "American ", "Russian", "Hamburgers", "Chinese", "Spanish", "Barbecue", "Pizza", "Hotdogs/Pretzels", "Japanese", "Donuts", "Bagels/Pretzels", "Korean", "Indian", "Bakery", "Café/Coffee/Tea", "Juice, Smoothies, Fruit Salads", "German", "Mexican", "Irish", "Chicken", "Sandwiches", "Caribbean", "Cajun", "French", "Mediterranean", "Chinese/Japanese", "Seafood", "Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Pancakes/Waffles", "Hotdogs", "Turkish", "Sandwiches/Salads/Mixed Buffet", "Tex-Mex", "Thai", "Southwestern", "Middle Eastern", "Continental", "Filipino", "Asian", "Bottled beverages, including water, sodas, juices, etc.", "Soups", "Steak", "Other", "Eastern European", "African", "Not Listed/Not Applicable", "Portuguese", "Polish" ] }

Another example is if we want to extract the unique postal codes for each city:

db.zipcodes.aggregate([{"$group":{"_id":"$city", "postal_codes":{$addToSet:"$_id"}}}])

Here’s a short sample output:

{ "_id" : "OTIS", "postal_codes" : [ "01253", "71466", "80743", "97368" ] }
{ "_id" : "BARRE", "postal_codes" : [ "01005", "05641" ] }
{ "_id" : "NANTY GLO", "postal_codes" : [ "15943" ] }
{ "_id" : "BIRCH HARBOR", "postal_codes" : [ "04613" ] }
{ "_id" : "COEYMANS HOLLOW", "postal_codes" : [ "12046" ] }
{ "_id" : "HOLYOKE", "postal_codes" : [ "55749", "01040", "80734" ] }

Do you recall the $push operator? It is similar to $addToSet but it also adds non-unique values to arrays. The above example can in fact be rewritten with $push as all zipcode documents have a unique postal code ID so there’s no chance of ending up with duplicated records in the “postal_codes” array. Here’s the solution with $push instead:

db.zipcodes.aggregate([{"$group":{"_id":"$city", "postal_codes":{$push:"$_id"}}}])

It will produce the same set of documents as the $addToSet solution.

Compound aggregations

We can group by multiple items just like in standard SQL. The compound _id technique we saw above is the key to build compound grouping keys. Let’s say we want to group the restaurants by borough and cuisine and see how many restaurants there are for each combination, e.g. how many “Hamburgers” restaurants there are in Queens:

db.restaurants.aggregate([{$group : {_id : {"borough" : "$borough", "cuisine":"$cuisine"}, "restaurants_count" : {"$sum" : 1}}}])

Here’s some sample output:

{ "_id" : { "borough" : "Brooklyn", "cuisine" : "Soups & Sandwiches" }, "restaurants_count" : 5 }
{ "_id" : { "borough" : "Brooklyn", "cuisine" : "Hamburgers" }, "restaurants_count" : 102 }
{ "_id" : { "borough" : "Queens", "cuisine" : "Pakistani" }, "restaurants_count" : 11 }
{ "_id" : { "borough" : "Manhattan", "cuisine" : "Chinese" }, "restaurants_count" : 510 }

So far in this part of the series on aggregations we’ve only seen simple pipelines with a single stage. However, it’s called a pipeline since it can be extended with multiple stages. You’ve probably noticed that the single grouping stage is enclosed in an array that we supply to the aggregate function. That array can be extended of course with other stages.

We’ll continue our discussion with multiple stages and other stage types in the next post.

You can view all posts related to data storage on this blog here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

iReadable { }

.NET Tips & Tricks

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: