Introduction to CouchDB with .NET part 12: more MapReduce examples

Introduction

In the previous post we first saw how to insert and update design documents via the HTTP API. It is not very different from the equivalent operations on “normal” data documents. However, we need to consider the keywords in a design documents such as “views”, “map” and “reduce”. We also saw how to select compound keys and values in the map function of the view index. Compound keys are very helpful when executing more complex queries such as “select all users above the age 20 and an address in Washington”. We went through a number of examples about limiting the range of the result set using the startkey and endkey query parameters.

In this post we’ll continue where we left off previously and go through more MapReduce examples.

Counting and summing up the results

We’ve already seen the _count function in action but let’s look at it once again. Recall the following query from the previous post:

GET http://localhost:5984/children/_design/name/_view/age-gender?endkey=%5B“m”, 11]&startkey=[“m”, 0]

…where we want to select all boys up to the age of 11. In order to return the number of documents we need to add a reduce phase and use the _count function as follows:

"age-gender": {
      "map": "function (doc) {if (doc.age && doc.gender) { emit([doc.gender, doc.age], 'Full name: ' + doc.first_name + ' ' + doc.last_name); }}",
      "reduce": "_count"
    }

It returns 5 as the result since we have 5 documents that match the criteria:

{
  "rows": [
    {
      "key": null,
      "value": 5
    }
  ]
}

Recall that we can turn off the reduce phase by adding the “reduce=false” query parameter:

GET http://localhost:5984/children/_design/name/_view/age-gender?endkey=%5B“m”, 11]&startkey=[“m”, 0]&reduce=false

We’re back at the original output. This is good since we don’t need to have 2 separate views, one with a reduce function and another without it.

What about the built-in reducer _sum? Let’s find out. We’ll first commit a deliberate error and simply update the age-gender view of the _design/name document as follows:

"age-gender": {
      "map": "function (doc) {if (doc.age && doc.gender) { emit([doc.gender, doc.age], 'Full name: ' + doc.first_name + ' ' + doc.last_name); }}",
      "reduce": "_sum"
    }

Execute…

GET http://localhost:5984/children/_design/name/_view/age-gender?endkey=%5B“m”, 11]&startkey=[“m”, 0]

…only to see an exception:

{
  "error": "invalid_value",
  "reason": "The _sum function requires that map values be numbers, arrays of numbers, or objects, not '<<\"Full name: William Hall\">>'. Objects cannot be mixed with other data structures. Objects can be arbitrarily nested, provided that the values for all fields are themselves numbers, arrays of numbers, or objects.",
  "ref": 604603324
}

Let’s extend the “name” design document with another view:

"age-gender-sum": {
      "map": "function (doc) {if (doc.age && doc.gender) { emit([doc.gender, doc.age], doc.age); }}",
      "reduce": "_sum"

Now execute the GET request above with the start and end key filter. We’ll get the following result:

{
  "rows": [
    {
      "key": null,
      "value": 34
    }
  ]
}

34 is the sum of all of the values in the result set, i.e. all the ages. Which is not too useful but at least we now know how the _sum function works.

The exception message above said that we can also have an array of numbers for the values. Let’s try that. Modify the map function of age-gender-sum map function as follows:

function(doc) {
    if (doc.age && doc.gender) {
        emit([doc.gender, doc.age], [doc.age, doc.age * 2]);
    }
}

We return the age and the double of the age in an array. Again, it’s not the most useful statistic but we primarily want to find out how the _sum function works. Running the same query URL as above returns the following:

{
  "rows": [
    {
      "key": null,
      "value": [
        34,
        68
      ]
    }
  ]
}

We got the sum of the first and the second element of the array items which can be really useful when summing up multiple numeric values.

Let’s also find out how summing up objects can work. Here’s the updated age-gender-sum index map function:

function(doc) {
    if (doc.age && doc.gender) {
        emit([doc.gender, doc.age], {"age": doc.age, "age-double": doc.age * 2});
    }
}

…i.e. it’s the same as above with the array but we return the age and its double in an object. Here’s the result of the reduce query:

{
  "rows": [
    {
      "key": null,
      "value": {
        "age": 34,
        "age-double": 68
      }
    }
  ]
}

I think that’s really cool.

Limiting the result set

We’ve seen the “limit” query parameter in action. It is similar to TOP [number] in SQL to build “best-of” or “top-x” lists. Here’s an example:

http://localhost:5984/children/_design/name/_view/age-gender-sum?endkey=%5B“m&#8221;, 11]&startkey=[“m”, 0]&reduce=false&limit=2

This will return the 2 youngest children from the total result set.

Grouping

We can achieve grouping in CouchDB through compound keys returned from the map function. We can use the children demo database to illustrate grouping. We group the kids by their gender and age. Note that the order of grouping matters: we either group by gender first and then by age or vice versa.

This is also true of more complex grouping keys. E.g in a “cities” database a city can have a name, a postal-code, a country, a continent. The same city can appear multiple times if it has several postal codes. In that case we can group by city, then country and continent, giving us a grouping key of 3 members.

Anyway, let’s return to our children database. Let’s first extend the views section of the _design/name design document:

"age-gender-group": {
      "map": "function(doc) {    if (doc.age && doc.gender) {        emit([doc.gender, doc.age], doc.first_name + ' ' + doc.last_name);    }}",
      "reduce": "_count"
    }

There’s nothing new here I suppose, we’ve seen all that before. The group query parameter must be set to true in order to perform the grouping:

GET http://localhost:5984/children/_design/name/_view/age-gender-group?group=true

The children database is quite small with little variation so the result set is not too exciting. We have at least some cases where the count is 2:

{
      "key": [
        "f",
        8
      ],
      "value": 2
    },
{
      "key": [
        "m",
        5
      ],
      "value": 2
    }

We have 2 girls of 8 years and 2 boys of 2 years.

We can specify the group level in the query. In our case we have 2 grouping keys, i.e. we can go 2 levels down in total. That’s the default behaviour if we don’t provide the grouping level, i.e. CouchDB will use all the provided grouping keys. The group_level parameter allows us to modify that behaviour. Consider the following request:

GET http://localhost:5984/children/_design/name/_view/age-gender-group?group=true&group_level=1

So we have two grouping keys: the gender and the age, in that order. Setting group_level equal to 1 means that we only want to use the first grouping key in our array of grouping keys. That will be gender in our case. The query returns the grouping by gender:

{
  "rows": [
    {
      "key": [
        "f"
      ],
      "value": 7
    },
    {
      "key": [
        "m"
      ],
      "value": 7
    }
  ]
}

We have 7 boys and 7 girls in the database. The default group_level is called “exact” which takes all grouping keys into account.

Custom reducers

We can provide our own JS function for the reducer. The function must accept the keys and the values from the map phase. Optionally it also accepts a parameter called rereduce. In this demo we’ll see how to return the maximum age in the children database.

Add a new view to the “name” design document called “age-gender-reduce”. For the reducer select CUSTOM which will provide you with a basic implementation of a reducer:

Adding a custom reducer to a MapReduce function in Fauxton UI CouchDB

Set the map function to the following:

function(doc) {
    if (doc.age && doc.gender) {
        emit(doc.first_name + ' ' + doc.last_name, doc.age);
    }
}

Here’s the state of the design document before saving:

First version of custom reduce function in Fauxton UI CouchDB

Let’s see what the custom function looks like:

function (keys, values, rereduce) {
  if (rereduce) {
    return sum(values);
  } else {
    return values.length;
  }
}

The keys and values parameters are exactly the keys and values passed in from the map function. We can apply any valid JavaScript function on the keys and the values. However, what is this rereduce function? It’s quite mysterious to say the least. At first we only see from the code that it is a boolean parameter. If it’s true then we return the sum of the values, otherwise we return the length of the values array, i.e. the count. Save the view and execute it:

GET http://localhost:5984/children/_design/name/_view/age-gender-reduce

It will return the count:

{
  "rows": [
    {
      "key": null,
      "value": 14
    }
  ]
}

The rereduce parameter refers to the reduce function being called multiple times on large data sets. In other words the reduce function can be called recursively using the result of the previous iteration. This guide provides a good explanation of the purpose of this parameter:

The reason for this is that – when a view contains a large number of rows – CouchDB uses a divide and conquer strategy to calculate reduce results more efficiently. It does this by breaking up the key/value pairs into smaller sets and running the reduce function on each of these smaller sets separately. Once this is done, it bundles all of the results into an array and runs this array through the reduce function again. This process can happen several times before the final result is produced. When the reduce function is run for the results of a previous reduce, the rereduce parameter is set to true so you can handle it properly.

Our children database is small so rereduce won’t ever be true. Therefore the reducer will always return the count. However, it’s good to know about the purpose of this parameter as you may come across it in large CouchDB database.

The rereduce parameter is actually only optional. We can update the custom reduce function to the following:

function (keys, values) {
  var max = 0;
  for (var i = 0; i < values.length; i++) {
    if (values[i] > max) {
      max = values[i];
    }
  }
  return max;
}

That’s some simple JavaScript to find the maximum value in the values array. Note the absence of the rereduce parameter. Here’s the result:

{
  "rows": [
    {
      "key": null,
      "value": 13
    }
  ]
}

13 is indeed the correct answer.

Use grouping to extract unique values

There’s a very peculiar application of grouping and custom reducers in CouchDB. Say that we want to collect all the unique ages in the database. There’s no built-in function to do that so we need to take a different approach. The first step is to have a view whose key will be grouped. We want to find the unique age values so we’ll need to group on the age field of the children documents. Second we’ll need a dummy reducer that returns some default value like 1. We’ll soon see why. Add the following view to the _design/name design document:

"age-unique": {
      "map": "function(doc) {if (doc.age) {        emit(doc.age, doc.first_name + ' ' + doc.last_name);    }}",
      "reduce": "function(keys, values) {   return 1;}"
    }

Running the view…

GET http://localhost:5984/children/_design/name/_view/age-unique

…won’t return the most exciting data:

{
  "rows": [
    {
      "key": null,
      "value": 1
    }
  ]
}

However, we can now activate grouping like we did above:

GET http://localhost:5984/children/_design/name/_view/age-unique?group=true

…and there we go, it will return the unique keys, i.e. the unique ages in ascending order.

Read the next part here.

You can view all posts related to data storage on this blog here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

One Response to Introduction to CouchDB with .NET part 12: more MapReduce examples

  1. Pingback: CouchDB Weekly News, June 15, 2017 – CouchDB Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

Elliot Balynn's Blog

A directory of wonderful thoughts

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: