Using Amazon DynamoDb with the AWS .NET API Part 7: its place in Big Data

Introduction

In the previous post we looked at how to query our test table in Amazon DynamoDb. That post also completed our main discussion on how to use DynamoDb in .NET.

This post is dedicated to a larger picture: Big Data. This series is part of a large series on cloud-based Big Data infrastructure along with the message handler Amazon Kinesis and the raw data storage S3. Therefore the pre-requisite of following the discussion in this post is familiarity with what we discussed in those posts or at least some knowledge of Big Data. However, I’ll try to write in a way that those of you who’ve only come here for DynamoDb may get a taste of the role it can play in a cloud-based Big Data architecture.

DynamoDb in Big Data

As DynamoDb is a storage mechanism it can potentially be used in any storage-related scenario within a Big Data architecture. One such example is raw data storage, i.e. a place to store the incoming raw data messages from the devices that send their data to your system.

However, DynamoDb was not exactly designed to handle a large influx of messages within a short period of time. We’ve already identified S3 as our raw data storage so we won’t change that strategy. You’d need to increase the write throughput to a very large number otherwise you’ll start getting exceptions because the specified throughput limit has been exceeded. As a result some of your messages will be lost and your Amazon bill will be higher.

Data aggregations

Instead, DynamoDb can be used in another way, namely storing data aggregations. Normally you don’t just want to sit on your source data but do something with it. This “do something” bit usually involves some data mining and aggregations. In data analysis apps your customers will be after some complex aggregation queries:

  • What was the maximum response time of /Products.aspx on my web page for users in Seattle using IE 10 on January 12 2015?
  • What was the average sales of product ‘ABC’ in our physical shops in the North-America region between 01 and 20 December 2014?
  • What is the total value of product XYZ sold on our web shop from iOS mobile apps in France in February 2015 after running a campaign?

When the data aggregation is complete then the aggregated data can be stored in a number of cloud-based storage devices that Amazon has to offer:

  • S3 which we saw before in this blog
  • DynamoDb
  • Relational Database Service (RDS), a traditional relational database where you can choose a number of DBMS like MySql, SQL Service, Postgresql etc.
  • Amazon RedShift, a columnar data-mining storage especially suited for data aggregations and storage – we’ll take it up in a later series

So where does DynamoDb fit in? It will largely depend on which aggregation mechanism you’ll go for. Amazon has 2 – or at least 2 – services especially made for data mining and analysis: RedShift and Elastic MapReduce (EMR). For Elastic MapReduce you have 2 options at the time of writing this post. You can either export the aggregated data back to S3 or to DynamoDb. This is where DynamoDb fits into the Amazon Big Data infrastructure. We’ll take up EMR in the next series.

We’ll prepare the necessary DynamoDb table in the next series so that the aggregation result can be stored there.

What are these tools like?

I briefly introduced these aggregation mechanisms in another series on this blog, I’ll copy the relevant sections here. We’ll take up both EMR and RedShift in more detail later on but I’d like to give you a preview here.

Aggregation mechanism option 1: Elastic MapReduce

Elastic MapReduce (EMR) is Amazon’s web service with the Hadoop framework installed. What’s MapReduce? I’ve introduced MapReduce in an unrelated post elsewhere on this blog.

Hadoop is eagerly used in the distributed computing world where large amounts of potentially unstructured input files are stored across several servers. The computations are also distributed across the available nodes in the cluster. The final goal is of course to process these raw data files in some way, like what we’re trying to do in this series. If you’re planning to work with Big Data and data-mining then you’ll definitely come across Hadoop at some point. Hadoop is a large system with a steep learning curve.

Also, there’s a whole ecosystem built on top of Hadoop to simplify writing MapReduce functions and queries. There are even companies out there, like Hortonworks, who specialise in GUI tools around Hadoop so that you don’t have to write plain queries in a command line tool.

Some of the technologies around Hadoop are the following:

  • Java MapReduce: you can write MapReduce applications with Java. These are quite low-level operations and probably not too many Big Data developers go with this option in practice but I might be wrong here
  • Hive: an SQL-like language that creates Java MapReduce functions under the hood and runs them. This is a lot less complex solution than writing a Java MapReduce function
  • Pig: Pig – or Pig Latin – is another language/technology to handle large amounts of data in a fast, parallel and efficient way

…and a whole lot more but Hive and Pig seem to be the most popular ones at the time of writing this post.

Hadoop is normally installed on Linux machines although it can be deployed on Windows with some tricks – and luck :-). However, on production systems you’ll probably not want to have Hadoop on Windows. Therefore if you start a new instance of EMR in Amazon then you’ll be given one ore more Linux EC2 machines with Hadoop and a couple of other tools installed. E.g. Hive and Pig will follow along although you can opt out of those when you start a new Hadoop instance. It’s good to have them installed so that you can test your Hive/Pig scripts on them.

Aggregation mechanism option 2: Amazon RedShift

Amazon RedShift is Amazon’s data warehousing solution. It follows a columnar DBMS architecture and it was designed especially for heavy data mining requests.

RedShift is based on PostgreSQL with some Amazon specific additions e.g. for importing raw data values from and S3 bucket. PostgreSQL syntax is very similar to other SQL languages you may be familiar with such as MS SQL. Therefore learning the basics of the language doesn’t require a lot of time and effort and you can become productive quite fast.

Having said all that, here comes a warning. The version of PostgreSQL employed on RedShift has some serious limitations compared to the full blown PostgreSQL. E.g. stored procedures, triggers, functions, auto-incrementing primary keys, enforced secondary keys etc. are NOT SUPPORTED. RedShift is still optimised for aggregation functions but it’s a good idea to be aware of the limits. This page has links to the lists of all missing or limited features. You’ll need to be prepared to write set-based SQL logic instead of procedural SQL statements like many programmers would do. This page explains very well with examples where the differences lie.

We’ll see examples of Postgresql on RedShift in an upcoming series.

RedShift vs. Elastic MapReduce

You may be asking which of the two aggregation mechanism is faster, EMR with Hive or RedShift with PostgreSQL. According to the tests we’ve performed in our project RedShift aggregation jobs run faster. A lot faster. RedShift can also be used as the storage device for the aggregated data. Other platforms like web services or desktop apps can easily pull the data from a RedShift table. You can also store the aggregated data on EMR on the Hadoop file system but those are not as readily available to other external platforms. We’ll look at some more storage options in the next part of this series so I won’t give any more details here.

This doesn’t mean that EMR is completely out of game but if you’re facing a scenario such as the one described in this series then you’re probably better off using RedShift for the aggregation purposes. However, if you’re building a UI where the client can put together completely flexible queries then Hadoop is probably a better option. RedShift is more used for predefined aggregations – these aggregations can be quite granular but still finite. E.g. you might only support the following time based aggregations: 15mins, 30mins, 1hr, 6hrs, 24hrs, in which case RedShift-based aggregation jobs will be ideal. However, if the user can freely select the aggregation period along with other parameters then Hadoop, possibly with Pig, is a better fit.

View all posts related to Amazon Web Services and Big Data here.

Advertisement

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

6 Responses to Using Amazon DynamoDb with the AWS .NET API Part 7: its place in Big Data

  1. John Trimming says:

    Andras, do you know how common it would be to use Entity Framework approaches with a noSQL databases such as DynamoDB? There are ADO.Net providers written for it but I don’t know how widely these are used or whether it’s a good idea to use them…?

    • Andras Nemes says:

      John, I haven’t worked with any AWS-based projects other than what we have at the company and we don’t use any special provider. I usually trust the AWS libraries so I haven’t even considered looking into third-party providers. //Andras

  2. John Trimming says:

    One provider is provided by RSSBUS: http://cdn.rssbus.com/help/DD1/ado/pg_ef6.htm

  3. Marcelo says:

    Your articles are amazing and helped me a lot. Thanks from Brazil.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: