Introduction to MongoDb with .NET part 30: starting with indexing
May 19, 2016 Leave a comment
Introduction
In the previous post we finished our discussion of the aggregation framework in MongoDb. We now know the basics of building aggregation pipelines in both the mongo shell using JavaScript with JSON and using the MongoDb .NET driver. In particular we looked at how to write strongly typed aggregation queries with dedicated methods in the aggregation fluent API.
Starting with this post we’ll shift our focus towards a different topic within MongoDb: indexes and related topics in performance.
General remarks
Chances are that you know approximately what indexes do in a database. Full-stack developers normally cannot avoid them while optimising the database tables. MongoDb also offers indexes in much the same way as relational databases do. They offer a way to make searches much quicker by maintaining sort of a sorted register over the documents. Sorting is important as it makes for a very quick and efficient search compared to an unsorted set of documents. If you want to query a collection with a million documents and no indexes then the search engine will need to look at every single document in that collection. If, however, there’s an index in front of the collection then the search engine can quickly find the relevant documents by consulting the index, i.e. the sorted register first.
The kind of index you build will depend on a several parameters:
- Queries: you as the developer will – or at least should – have a good idea about the type of queries hitting the database from the application. If there are a lot of searches on specific fields such as the “name” then it can be a good idea to have an index on that field. Composite indexes, i.e. indexes that comprise multiple keys can handle queries that cover more than one field, such as searches on the “name”, “age” and “hobbies” fields
- Reads vs. writes: indexes don’t come for free. In MongoDb indexes are documents that are updated with every new record in the database. If you build an index on the “name” field then the name index must be updated with every new document of course otherwise they will not show up in the search result. A name index will keep the names in ascending or descending order depending on how the index was specified. This ordering will make searches very efficient. Then if there’s a new document with a new name then this name must be squeezed into the name index. This implies that write operations are slower compared to the situation with no indexes at all. However, reads will be much faster. You’ll need to consider the frequency of reads and writes before you blindly start creating indexes on every single field. This latter strategy is never a good idea by the way.
- Index size: since indexes are also documents they will keep growing with the number of new documents
- Single-key vs. multi-key indexes: as mentioned above it’s possible to build composite indexes. If there’s a frequent search on the name and age fields then it can be a good idea to create a composite index of “name” and “age”. That kind of index will also help with searches on the name field alone. However, if there’s a search solely on the age field then the multi-key index won’t be usable as the primary sorting key is the name. Age is only the secondary sorting key in that case and an age-based query will need to go through all documents
MongoDb collection statistics
Before we actually create indexes I briefly want to mention a utility function that can be called on a collection in the Mongo shell: stats(). Recall that we’ve been working with 2 demo collections in this tutorial: zip codes and restaurants. If you’d like to extract some basic statistics about a collection then you can write something like this in the Mongo shell:
db.zipcodes.stats()
It will return a large JSON document where most of the output is actually related to the storage engine of the database. A storage engine “is the component of the database that is responsible for managing how data is stored, both in memory and on disk. MongoDB supports multiple storage engines, as different engines perform better for specific workloads.“. In other words it is the soul of the database. MongoDb offers a pluggable interface where the storage engine implementation can be replaced in much the same way as we can have multiple implementations of a certain C# interface. If you have the time and knowledge and happen not be satisfied with the query performance then you could develop your own MongoDb storage engine. I guess not too many people out there do that.
As of version 3.2.4 the default storage engine that ships with the standard installation package is called WiredTiger. You’ll see a sub-document called “wiredTiger” in the stats output with a lot of meta-data in it. I think it’s safe to assume that most developers won’t need to know about the details of the storage engine but it’s good to know that it exists and where to find information about it. To be honest I don’t really know what all those properties like “LSM” and “block-manager” mean and I’ve never needed to dig deeper there either. DBAs might want to know more about it though.
The other properties in the JSON document can be more interesting for us:
- “ns” : the collection namespace, “model.zipcodes” for the case of the demo zip codes collection, where “model” is the database name
- “count” : the number of documents in the collection
- “size” : the total space reserved for the collection in bytes
- “avgObjSize” : the average document size which is the total size divided by the number of documents
- “capped” : capped collections “are fixed-size collections. … Capped collections work in a way similar to circular buffers: once a collection fills its allocated space, it makes room for new documents by overwriting the oldest documents in the collection.” By default a collection is not capped, it must be explicitly set in the createCollection function as described here.
The final section of the statistics shows a number of properties related to indexes:
- “nIndexes” : the number of indexes, currently 1. By default there’s always an index on the _id field
- “totalIndexSize” : the total size of all indexes in the database
- “indexSizes” : this sub-document shows the size of each index. Currently there’s only one index on the _id field
In the next post we’ll start looking into index creation in the Mongo shell.
You can view all posts related to data storage on this blog here.