Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 5


In the previous post we extended our Amazon Big Data design with Amazon RedShift. In this post we’ll look at some data storage options for the aggregated data. Also, we’ll look at an option on how to keep this flow running, i.e. how to make sure that aggregation happens at regular intervals automatically.

Storing aggregated data

Say that the aggregation mechanism has finished the aggregations that are interesting to your business. The aggregated data must also be stored somewhere that’s accessible to other systems. By “other systems” I mean applications where the end users can view the aggregations and run other queries, like a web site with lots of nice graphs. The input data for those graphs should be easily accessible from a data store.

For Elastic MapReduce you have 2 options at the time of writing this post. You can either export the aggregated data back to S3 or to DynamoDb. We’ve seen S3 before in this series but DynamoDb is new.

DynamoDb is Amazon’s cloud-based NoSql database. If you worked with databases like MongoDb or RavenDb before then you’ll see that DynamoDb is similar. There are no schemas, you can store any type of unstructured data in DynamoDb documents. I’d say that DynamoDb provides better structuring and the usual CRUD mechanisms are better supported than in S3. If you’d like to save your aggregated data from EMR somewhere else, like RedShift or Amazon Relational Database Service (RDS) then you’ll have to do it indirectly: save the data in S3 first and then export it to RedShift or RDS from S3 via some automation service like Amazon Import/Export or Amazon Data Pipeline – more on Data Pipeline below.

If you use RedShift as the aggregation mechanism then RedShift provides excellent PostgreSQL-based tabular storage as well so there’s probably no need to look any further. You can probably export the aggregation results to DynamoDb or S3 but I don’t really see the point. RedShift tables are easily accessible for a wide range of technologies: .NET, Java, Python etc.

Let’s extend our architecture diagram with the storage mechanisms:

Amazon Big Data Diagram with aggregation result storage


By automation I mean the automation of the aggregation jobs. You’ll probably want the aggregation job to run at defined intervals, say every 15 minutes. At the same time you might want to start an ad-hoc aggregation job outside the automation interval. One Amazon-based option is the following setup:

  • Build a “normal” Java application that goes through the application process by way of calling he aggregation mechanism – EMR or RedShift – to run one or more aggregation scripts
  • Compile the Java app into a JAR file
  • Save the JAR file in S3
  • Let the JAR file be executed by another Amazon service called Data Pipeline using a shell script (.sh) which is also stored in S3

AWS Data Pipeline (DP) is an automation tool that can run a variety of job types – or activities as they are called in DP. DP can execute jobs at intervals or just once, log the result, re-try failed jobs and much more. If you decide to try this solution then ShellCommandActivity is the activity type you’re looking for. I won’t provide any details here how to set up ShellCommandActivity here as this blog post is entirely dedicated to that.

Let’s add DP to our diagram:

Amazon Big Data Diagram with Amazon Data Pipeline


We’re actually done with the core of our Big Data system. However, it can be extended in numerous ways, here are some examples:

  • Accessing Kinesis requires that you provide your security credentials in the Kinesis producer, i.e. the application that sends the raw data messages to Kinesis. E.g. if you’re collecting the response times from a HTML page then the underlying JS file will need to include your credentials which makes your Amazon account very vulnerable. An option to alleviate the problem is to set up a public web page in front of Kinesis, like a web service. This service can then itself forward the message to Kinesis. Another option is to set up temporary credentials for the producers. This page describes how to do that with the AWS Security Token Service.
  • Amazon has an efficient in-memory caching solution called ElastiCache. The aggregation mechanism could potentially save the aggregated data in the data store and also push it to the cache. The consuming application will then first consult the cache instead of the database to ease the load

I have another blog series dedicated to Big Data with .NET where I go through some of these components in greater detail and a lot of code examples.

This post concludes this series. I hope you’ve learnt a lot of good stuff.

View all posts related to Amazon Web Services here.


About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog


Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: