Introduction to MongoDb with .NET part 32: compound indexes

Introduction

In the previous post we created our first index on the zip codes collection using the Mongo shell. We inserted a text on the state property. We also got introduced to the explain function which presents a query plan with or without executing the query depending on the boolean parameter to the function. It’s a good strategy to run the explain function before and after the index creation. Your goal should be to reduce the amount of documents scanned as much as possible.

In this post we’ll take a look at compound indexes.

Compound indexes

A compound index is one which consists of two or more properties of a collection. They are useful if a frequent query involves multiple filters. We can take an example from the restaurants collection. Here’s an example document:

db.restaurants.findOne()
{
        "_id" : ObjectId("56edc2ff03a1cd840734dba8"),
        "address" : {
                "building" : "2780",
                "coord" : [
                        -73.98241999999999,
                        40.579505
                ],
                "street" : "Stillwell Avenue",
                "zipcode" : "11224"
        },
        "borough" : "Brooklyn",
        "cuisine" : "American ",
        "grades" : [
                {
                        "date" : ISODate("2014-06-10T00:00:00Z"),
                        "grade" : "A",
                        "score" : 5
                },
                {
                        "date" : ISODate("2013-06-05T00:00:00Z"),
                        "grade" : "A",
                        "score" : 7
                },
                {
                        "date" : ISODate("2012-04-13T00:00:00Z"),
                        "grade" : "A",
                        "score" : 12
                },
                {
                        "date" : ISODate("2011-10-12T00:00:00Z"),
                        "grade" : "A",
                        "score" : 12
                }
        ],
        "name" : "Riviera Caterer",
        "restaurant_id" : "40356018"
}

We can imagine that a frequent query would filter on the borough and cuisine properties. E.g. give me all Thai restaurants in Manhattan:

db.restaurants.find({"borough" : "Manhattan", "cuisine" : "Thai"})

The explain command…

db.restaurants.explain(true).find({"borough" : "Manhattan", "cuisine" : "Thai"})

…will tell us the following:

  • In the absence of any indexes on the borough or cuisine fields MongoDb performs a complete query scan – recall the COLLSCAN stage in the winning plan sub-document
  • 154 documents are returned and all 25362 documents in the collection were scanned

Let’s try to improve this. We’ll first see if a single index on the borough field helps. Insert the following index:

db.restaurants.createIndex({"borough" : 1})

We didn’t provide any name for the index. MongoDb will give it a default name of (property_name)_(1 or -1 whether ascending or descending), e.g.:

borough_1
state_-1

Running the same explain command gives us the following:

  • An index scan was performed with the borough_1 index
  • The index scan returned 10259 documents which is the number of restaurants in Manhattan in the collection
  • The total number of scanned documents were therefore 10259 which is at least some improvement compared to what we had before

Here are the relevant bits of the query plan to support the above – I’ve removed the “noise”:

 "winningPlan" : {
                        "stage" : "FETCH",
                        "filter" : {
                                "cuisine" : {
                                        "$eq" : "Thai"
                                }
                        },
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        "borough" : 1
                                },
                                "indexName" : "borough_1",

...

 "executionStats" : {                
                "nReturned" : 154,
                
                "totalDocsExamined" : 10259,
                "executionStages" : {
                        "stage" : "FETCH",
                        "filter" : {
                                "cuisine" : {
                                        "$eq" : "Thai"
                                }
                        },
                        "nReturned" : 154,                        
                        "docsExamined" : 10259                        
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "nReturned" : 10259,
                                
                                "indexName" : "borough_1",
                                
        },

We have therefore 2 stages in total. First the index scanning stage which returns 10259 documents, i.e. the restaurants in Manhattan. Then we have a FETCH stage where all 10259 documents are scanned and filtered on the cuisine field.

Let’s see if a compound index can further improve the results. First we’ll remove the index with the dropIndex function. The dropIndex function accepts the same JSON parameter with which we created the index:

db.restaurants.dropIndex({"borough" : 1})

The following command will create a compound index on the borough and the cuisine fields:

db.restaurants.createIndex({"borough" : 1, "cuisine" : 1})

Running the explain function now will tell us the following:

  • The compound index was indeed used to execute the query
  • The index scan stage returned 154 documents
  • The total number of documents examined is 154 which also equals the number of documents the query returns
  • The total execution time was 1 millisecond

Here are some relevant details from the query plan:

 "winningPlan" : {
                        "stage" : "FETCH",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        "borough" : 1,
                                        "cuisine" : 1
                                },
                                "indexName" : "borough_1_cuisine_1",

...

 "executionStats" : {                
                "nReturned" : 154,
                "executionTimeMillis" : 1,
                "totalKeysExamined" : 154,
                "totalDocsExamined" : 154,
                "executionStages" : {
                        "stage" : "FETCH",
                        "nReturned" : 154,                        
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "nReturned" : 154,
                                "executionTimeMillisEstimate" : 0,                                
                                "indexName" : "borough_1_cuisine_1",

We’re certainly better off compared to the state where we had no indexes at all.

Can the above index help us with a filter on the borough property only? Let’s try the following:

db.restaurants.explain(true).find({"borough" : "Manhattan"})

Yes, it definitely helps. The total number of documents returned is 10259 which is also the number of scanned documents. The query plan shows that the compound index was used to accelerate the query execution.

What about a query on the cuisine field only? Let’s see:

db.restaurants.explain(true).find({"cuisine" : "Thai"})

No, our compound index doesn’t help at all:

 "winningPlan" : {
                        "stage" : "COLLSCAN",

...

 "executionStats" : {
                "executionSuccess" : true,
                "nReturned" : 288,
                "executionTimeMillis" : 13,
                "totalKeysExamined" : 0,
                "totalDocsExamined" : 25362,
                "executionStages" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                                "cuisine" : {
                                        "$eq" : "Thai"
                                }
                        },
                        "nReturned" : 288,

All documents in the collection had to be scanned.

This is logical if you think about it. Let’s say we have a telephone book from all English speaking countries, i.e. one for Canada, one for the US etc. According to Wikipedia there are 67 sovereign states and 27 other entities that fit this criteria, so we have 94 telephone books. If we want to find all people with the last name Jones that live in the UK then we can consult the UK telephone book and ignore the rest. However, if we want to find everyone with the last name Jones then we have to scan all 94 telephone books to collect all the relevant “documents”.

The telephone books represent a compound index of country and last name, in that order. The above analysis suggests that we have to go from left to right when assessing the usability of a compound index for a specific search. The first member of a compound index is the main index key, then comes the secondary key, followed by the tertiary and so on. If we have a compound index on A, B and C then the following queries will use the index:

  • Filter on A
  • Filter on A and B
  • Filter on A, B and C

Other filters, such as the ones on B or B and C will need to scan all the documents in the collection.

We’ll look at indexing arrays in the next post.

You can view all posts related to data storage on this blog here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

iReadable { }

.NET Tips & Tricks

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: