Introduction to MongoDb with .NET part 25: adding multiple stages to the aggregation pipeline

Introduction

In the previous post we continued our discussion of the grouping stage in the MongoDb aggregation pipeline. We looked at a couple of $ operators that can be used in aggregations, such as $min, $avg and $addToSet. We also saw how to use compound IDs to add multiple grouping keys. Compound IDs are also useful when we want to give an explanatory field to the compulsory _id property so that it’s clear what the grouping key is.

In this post we’ll start adding multiple stages to the pipeline. So far we’ve had simple aggregation pipelines with one stage but there’s nothing stopping us from adding more to the stage array. It’s also possible to add multiple instances of the same stage type.

Projections

Projections can reshape documents into various forms. We can add new properties, remove properties that are not needed, transform sub-documents so that they can be aggregated in an easier way. A projection stage is denoted by the $project operator. Keep in mind that the projection won’t modify the source documents at all. It will rather create new documents in-memory and pass them along to the next stage in the pipeline if any.

Let’s start with an easy example and transform the zip code documents a little:

db.zipcodes.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])
  • _id: 0 means that we don’t want to show the _id field in the result set
  • $toLower: this puts every character in a string into lower case. There’s a corresponding $toUpper operator
  • pop: 1 and state: 1 mean that we want to keep those fields from the source documents
  • zip: “$_id” means that we want to rename the _id field into zip

Here’s a sample output:

{ "city" : "agawam", "pop" : 15338, "state" : "MA", "zip" : "01001" }
{ "city" : "cushman", "pop" : 36963, "state" : "MA", "zip" : "01002" }
{ "city" : "belchertown", "pop" : 10579, "state" : "MA", "zip" : "01007" }
{ "city" : "blandford", "pop" : 1240, "state" : "MA", "zip" : "01008" }
{ "city" : "barre", "pop" : 4546, "state" : "MA", "zip" : "01005" }

It’s time to show the first extended pipeline. We’ll add a grouping stage after the projection stage. The important point is that the grouping stage will refer back to the properties in the documents coming from the stage before that, i.e. the projection. Here’s the average population by zip code:

db.zipcodes.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}, {$group : {_id : {"zipcode" : "$zip"}, "average_pop" : {$avg : "$pop"} }}])

Notice how we refer back to the projected zip field in the group stage to build a compound ID and show the meaning of the grouping key.

Filtering

It happens that we don’t want to include all documents in the aggregation. The $match aggregation stage works similarly to how we used the find() function by adding a JSON document as a filter.

The next example will filter out the places in the zipcodes collection whose population is less than 50k:

db.zipcodes.aggregate([{$match:{pop:{$gte:50000}}}])

We’ll also transform those documents:

db.zipcodes.aggregate([{$match:{pop:{$gte:50000}}}, {$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

…and finally we can group them, meaning we’ll group the places whose population is over 50k and calculate their average population sizes:

db.zipcodes.aggregate([
{$match:{pop:{$gte:50000}}}, 
{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}},
{$group : {_id : {"zipcode" : "$zip"}, "average_pop" : {$avg : "$pop"}}}
])

It can be a bit difficult for the eyes to follow all these curly braces in more complex aggregation queries. These 3 stages are still simple though.

The good thing with this modular setup is that you can build up your aggregation pipeline step by step and view the partial results in the Mongo client.

Sorting

We can also sort the end result using the $sort aggregation stage. Again it’s similar to how we did sorting in the Mongo shell before using 1 for ascending and -1 for descending order:

db.zipcodes.aggregate([
{$match:{pop:{$gte:50000}}}, 
{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}},
{$group : {_id : {"city_name" : "$city"}, "average_pop" : {$avg : "$pop"}}},
{$sort: {"_id.city_name" : 1}}
])

Note how we referred back to the city_name of the compound ID of the grouping stage from the sorting stage using the dot notation we saw before.

Here’s part of the result sorted by city name in ascending order:

{ "_id" : { "city_name" : "aiken" }, "average_pop" : 51233 }
{ "_id" : { "city_name" : "albuquerque" }, "average_pop" : 50233 }
{ "_id" : { "city_name" : "alhambra" }, "average_pop" : 51148 }
{ "_id" : { "city_name" : "allentown" }, "average_pop" : 59074 }
{ "_id" : { "city_name" : "alton" }, "average_pop" : 67604 }

Limiting and skipping

We can of course do limiting and skipping on the result set just like we did when filtering the results in the find function. They are denoted by the $limit and $skip operators and both accept integer values:

db.zipcodes.aggregate([
{$match:{pop:{$gte:50000}}}, 
{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}},
{$group : {_id : {"city_name" : "$city"}, "average_pop" : {$avg : "$pop"}}},
{$sort: {"_id.city_name" : 1}},
{$skip: 10},
{$limit: 5} 
])

It’s the same query as above but we skip the first ten documents and select at most 5. Here’s the result set:

{ "_id" : { "city_name" : "arlington height" }, "average_pop" : 52947 }
{ "_id" : { "city_name" : "aspen hill" }, "average_pop" : 52694 }
{ "_id" : { "city_name" : "atlanta" }, "average_pop" : 53894 }
{ "_id" : { "city_name" : "aurora" }, "average_pop" : 51422 }
{ "_id" : { "city_name" : "azusa" }, "average_pop" : 52261 }

We’ll close this post with a final example that uses the above operators. We want to group all cities in the state of New York and sum up their populations. Then we project the documents into a neater result set, sort the documents by population in descending order, skip the first 10 documents and limit the result set to 5 documents:

db.zipcodes.aggregate([
    {$match:
     {
	 state:"NY"
     }
    },
    {$group:
     {
	 _id: "$city",
	 population: {$sum:"$pop"},
     }
    },
    {$project:
     {
	 _id: 0,
	 city: "$_id",
	 population: 1,
     }
    },
    {$sort:
     {
	 population:-1
     }
    },    
    {$skip: 10},
	{$limit: 5}
])

Here’s the result:

{ "population" : 165629, "city" : "ASTORIA" }
{ "population" : 145967, "city" : "JACKSON HEIGHTS" }
{ "population" : 100646, "city" : "FAR ROCKAWAY" }
{ "population" : 85732, "city" : "RIDGEWOOD" }
{ "population" : 83017, "city" : "BINGHAMTON" }

We’ll look at more aggregation stage types in the next post.

You can view all posts related to data storage on this blog here.

Advertisement

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

2 Responses to Introduction to MongoDb with .NET part 25: adding multiple stages to the aggregation pipeline

  1. John Hurrell says:

    I wanted to offer a correction for the final query. The collection name is listed as “zipcpdes” but should be “zipcpoes”. I’m embarrassed at how long it took me to figure this out.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: