Introduction to CouchDB with .NET part 5: concurrency and eventual consistency

Introduction

In the previous post we continued our discussion of the CouchDB HTTP API. In particular we looked at those endpoints that modify the database and the documents. We inserted a new database and some documents. We also saw how to update documents. An interesting feature of CouchDB is that it keeps the old versions of an updated document until the database is compacted. Document deletion marks a document with the “_deleted” flag set to true. We can still read the older versions of the document and restore it like it was before. Again, this can only be done until the database has been compacted. DB compaction generally removes old versions of a document and those documents that were marked with the “_deleted” flag. These versions are called revisions and we saw the role of revision IDs

In this post we’ll briefly discuss how concurrency is handled in CouchDB.

Concurrency

By concurrency in a database we mean that multiple threads simultaneously try to modify and read the same data record, i.e. a document in our case. A user is updating a Person document with a new name and another user is trying to read the same document. How should this situation be handled? We cannot serve up a partially modified document where, say, the first name of the Person document was updated but the last name is still in the process of being updated.

Databases implement concurrency control in different ways. One solution is to lock the record entirely while it is being updated. The thread that’s trying to read the record needs to wait until the update has been completed. Also, if the database is clustered then the master node replicates the updated record(s) to all the slaves. All that must happen before the updated data record is available for reads. This is called strong consistency (SC). The main advantage of SC is that clients of the database always get the most recent version of a record, i.e. they never see any stale data. SC is important in scenarios where it’s crucial to see the most recent copy of the data. Examples include banking and financial applications, user profiles, payment-related processes and the like. The downside of SC is that the database will show a lower degree of availability with more users due to the database locks.

Another solution is when the record is not locked while it is updated. Instead the caller will get the most recent available copy of the requested data record. Inevitably the user will get an outdated version of the document from time to time. This will be the case if a thread starts updating a document and another thread wants to read the document before the update process has been completed. However, if the caller repeats the read operation after the update has been finalised then they will get the most updated copy of course. This solution is called eventual consistency (EC). We have a similar situation in a database cluster where the master node completes the update and propagates the changes to the slaves. If the read request of the caller hits one of the slaves that has not yet received the updated copy then they will get an outdated version. However, the data will eventually be consistent across all nodes, hence the name of eventual consistency. The main advantage of EC is high availability in the absence of data locks. More threads will be served and get a response. Although some of the responses may be outdated, which is also the main disadvantage of EC. However, in most cases we’re only talking about milliseconds since updates and data replication between nodes are fast. So don’t be put off by EC serving up old records, it won’t take long minutes between update initiation and completion. A typical application for EC is comments on an article or blog. Probably no-one will care if an old copy of a comment is shown on a page for some milliseconds before it is updated in the view.

Concurrency in CouchDB

It turns out that CouchDB follows eventual consistency. The CouchDB architecture puts more importance on high availability than on data consistency. The technology by which it achieves EC is called Multi-Version Concurrency Control (MVCC). We’ve already seen MVCC in action in the previous post, we just didn’t know about it. Recall how the various versions of a document got different revision IDs. Each revision ID starts with an integer: 1, 2, 3 etc. followed by a unique identifier. If a document is updated, its data is saved in a totally new document version and gets a new revision ID. The older revisions are intact, their data is not overwritten.

So let’s say the most recent revision of a document is 3 and an update request hits the database. We know that its revision number will be 4 after the update process has been completed. However, the database now receives a read request during the update execution. In SC the reading thread would be put on hold until the update thread has finished its task. In EC, and therefore CouchDB the most recent available copy will be served instead. In this example the caller will get revision 3.

MVCC is in action even when multiple threads try to update the same document. Let’s create a new database Persons like we saw in the previous post…:

PUT http://localhost:5984/persons

Next we insert a document:

POST http://localhost:5984/persons

…and set the payload to the following:

{
	"first-name": "John",
	"last-name": "Smith",
	"age": 20
}

…and also set the Content-Type header to application/json. The CouchDB HTTP API will respond with an ID and a first revision number. In my case it looks like the following:

{
  "ok": true,
  "id": "3559d9c81c785b6bfc27a349040177b0",
  "rev": "1-366ee70a116d71908c345d542b828f4c"
}

I’ll now update this document and change John’s age:

PUT http://localhost:5984/persons/3559d9c81c785b6bfc27a349040177b0?rev=1-366ee70a116d71908c345d542b828f4c

{
	"first-name": "John",
	"last-name": "Smith",
	"age": 24
}

The revision number bumped up to 2:

{
  "ok": true,
  "id": "3559d9c81c785b6bfc27a349040177b0",
  "rev": "2-2efa1aec4234e8a977ff640d1d099b3f"
}

I’ll execute one more update on revision 2:

PUT http://localhost:5984/persons/3559d9c81c785b6bfc27a349040177b0?rev=2-2efa1aec4234e8a977ff640d1d099b3f

{
	"first-name": "John",
	"last-name": "W. Smith",
	"age": 24
}

We’re at revision 3 now:

{
  "ok": true,
  "id": "3559d9c81c785b6bfc27a349040177b0",
  "rev": "3-a11c420a45f2ca5334522e72aefb899e"
}

Now imagine that another thread also tried to update revision 2 at the same time. Try to execute the same request as above on revision 2 with a slightly different payload:

http://localhost:5984/persons/3559d9c81c785b6bfc27a349040177b0?rev=2-2efa1aec4234e8a977ff640d1d099b3f

{
	"first-name": "John",
	"last-name": "Smith",
	"age": 25
}

We’ll get a different response this time: 409 Conflict with the following JSON response:

{
  "error": "conflict",
  "reason": "Document update conflict."
}

We’re not allowed to modify an old version like that. If the caller receives the above response then they’ll know that the document has been updated by someone else. The user will need to get hold of the latest revision.

We saw one way of doing that in the previous post with the revs_info query parameter set to true:

GET http://localhost:5984/persons/3559d9c81c785b6bfc27a349040177b0?revs_info=true

{
  "_id": "3559d9c81c785b6bfc27a349040177b0",
  "_rev": "3-a11c420a45f2ca5334522e72aefb899e",
  "first-name": "John",
  "last-name": "W. Smith",
  "age": 24,
  "_revs_info": [
    {
      "rev": "3-a11c420a45f2ca5334522e72aefb899e",
      "status": "available"
    },
    {
      "rev": "2-2efa1aec4234e8a977ff640d1d099b3f",
      "status": "available"
    },
    {
      "rev": "1-366ee70a116d71908c345d542b828f4c",
      "status": "available"
    }
  ]
}

This includes the revision history and the latest revision. We can also leave out the flag and only get the latest revision number as a result:

GET http://localhost:5984/persons/3559d9c81c785b6bfc27a349040177b0

{
  "_id": "3559d9c81c785b6bfc27a349040177b0",
  "_rev": "3-a11c420a45f2ca5334522e72aefb899e",
  "first-name": "John",
  "last-name": "W. Smith",
  "age": 24
}

There’s another variant to get hold of the revision numbers using the “revs” flag as follows:

http://localhost:5984/persons/3559d9c81c785b6bfc27a349040177b0?revs=true

It responds with a different format than what we saw above:

{
  "_id": "3559d9c81c785b6bfc27a349040177b0",
  "_rev": "3-a11c420a45f2ca5334522e72aefb899e",
  "first-name": "John",
  "last-name": "W. Smith",
  "age": 24,
  "_revisions": {
    "start": 3,
    "ids": [
      "a11c420a45f2ca5334522e72aefb899e",
      "2efa1aec4234e8a977ff640d1d099b3f",
      "366ee70a116d71908c345d542b828f4c"
    ]
  }
}

Instead of the “revs_info” property we have “_revisions”. It is a JSON object with a “start” property and an “ids” array. “start” indicates the current revision number, in this case 3. The ids array includes the revision IDs without the revision counter. We read this array from top to bottom to get the sequence of revisions:

3-a11c420a45f2ca5334522e72aefb899e
2-2efa1aec4234e8a977ff640d1d099b3f
1-366ee70a116d71908c345d542b828f4c

The most important point about CouchDB concurrency is that it favours eventual consistency. The primary objective is to make CouchDB highly available at the expense of possible data consistency issues. Keep this in mind when designing your database.

We’ll continue in the next post.

You can view all posts related to data storage on this blog here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

One Response to Introduction to CouchDB with .NET part 5: concurrency and eventual consistency

  1. Pingback: CouchDB Weekly News, June 1, 2017 – CouchDB Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

Elliot Balynn's Blog

A directory of wonderful thoughts

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: