Introduction to MongoDb with .NET part 22: starting with aggregations
May 4, 2016 Leave a comment
In the previous post we looked at bulk writes in the MongoDb .NET driver. We saw how various model allowed us to package together a number of insert, delete and update operations into one bulk write. The operations in the group are sent to the database to be executed at the same time.
We have therefore finished the discussion of the most important CRUD operations in this series. We are now familiar with the basics of SELECT, UPDATE, INSERT and DELETE both in the Mongo shell and the .NET driver. In this post we’ll start looking into a very different form of querying, namely aggregations. Aggregations are similar to grouping techniques in standard SQL using the GROUP BY clause. However, MongoDb aggregations are very different as far as the concept and the query syntax are concerned.
Aggregations in general
What are aggregations? What can we do with them? Aggregations are helpful when analysing the data in the database. After all the data stored in various collections is not too useful in itself. It’s rather the information and conclusions that you can read out from this data that have the real value. It helps you decide what actions to take in the future. Aggregations are very important in data mining, business intelligence (BI) and data science which are all very hot topics nowadays.
A related term that often comes up is MapReduce, or MapFilterReduce. Aggregations and MapReduce are similar ideas and MongoDb supports both of them. In this series I’ll stick to aggregations simply because I’m more familiar with them in MongoDb and they seem to be faster and more efficient than the equivalent MapReduce functions. In general we can express the same type of query with both aggregations and MapReduce functions. You can find more information and code examples on how MapReduce is implemented in MongoDb here. Real-life analysis queries can become quite complex, a lot more so than average SELECT statements.
Short summary of MapReduce
MapReduce is eagerly used in data mining and big data applications to find information from a large, potentially unstructured data set. E.g. finding the average age of all Employees who have been employed for more than 5 years is a good candidate for this algorithm.
The individual parts of Map/Filter/Reduce, i.e. the Map, the Filter and the Reduce are steps or operations in a chain to compute something from a collection. Not all 3 steps are required in all data mining cases. Basic examples:
- Finding the average age of employees who have been working at a company for more than 5 years: you map the age property of each employee to a list of integers but filter out those who have been working for less than 5 years. Then you calculate the average of the elements in the integer list, i.e. reduce the list to a single outcome.
- Finding the ids of every employee: if the IDs are strings then you can map the ID fields into a list of strings, there’s no need for any filtering or reducing.
- Finding the average age of all employees: you map the age of each employee into an integer list and then calculate the average of those integers in the reduce phase, there’s no need for filtering
- Find all employees over 50 years of age: we filter out the employees who are younger than 50, there’s no need for mapping or reducing the employees collection.
MapReduce implementations in reality can become quite complex depending on the query and structure of the source data.
The aggregation pipeline
A set of documents that will be analysed – aggregated – will often need to go through various stages before we get the required result which can be a single document or a set of documents. MongoDb has numerous aggregation stages that are all denoted by special ‘$’ keywords of which we’ve seen examples before like $set and $not. A group of $ operators are assigned to denote stages within the aggregation pipeline.
We can take our digestive system as a pipeline example. Food in itself is not too useful when your body needs energy. However, if you eat that food then it will go through various stages in your body as soon as you start chewing it. At the end it will end up in various forms including minerals, fibres, vitamins etc. that the body cells need to function properly.
The MongoDb aggregation pipeline also has one or more stages in it. A set of documents will enter the beginning of the pipeline and the documents will change shape as they are passed from one stage to the next in the pipeline. At the end of the pipeline you’ll get the final result.
We’ll look at a number of aggregation stages and examples in this part of the series. Read the next post here where we go through the first aggregation example.
You can view all posts related to data storage on this blog here.