Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 2

Introduction

In the previous post of this series we outlined the system we’d like to build. We also decided to go for Amazon Kinesis as the message queue handler and that we’ll need an application which will consume messages from the Kinesis message stream.

In this post we’ll continue to build our design diagram and discuss where to store the raw data coming from Kinesis.

Storing the raw messages: Amazon S3

So now our Kinesis application, whether it’s Java, Python, .NET or anything else, is continuously receiving messages from the stream. It will probably perform all or some of the following tasks:

  • Validate all incoming messages
  • Filter out the invalid ones: they can be ignored completely or logged in a special way for visual inspection
  • Perform some kind of sorting or grouping based on some message parameter: customer ID, UTC date or something similar
  • Finally store the raw data in some data store

This “some data store” needs to provide fast, cheap, durable and scalable storage. We assume that the data store will need to handle a lot of constant write operations but read operations may not be as frequent. The raw data will be extracted later on during the analysis phase which occurs periodically, say every 15-30 minutes, but not all the time. Aggregating and analysing raw data every 15 minutes will suit many “real time” scenarios.

However, we might need to go in an find particular raw data points at a later stage. We possibly need to debug the system and check visually what the data points look like. Also, some aggregated data may look suspicious in some way so that we have to verify its consistency.

One possible solution that fits all these requirements is Amazon S3. It’s a blob storage where you can store any type of file in buckets: images, text, HTML pages, multimedia files, anything. All files are stored by a key which is typically the full file name.

The files can be organised in buckets. Each bucket can have one or more subfolders which in turn can have their own subfolders and so on. Note that these folders are not the same type of folder you’d have on Windows but rather just a visual way to organise your files. Here’s an example where the top bucket is called “eux-scripts” which has 4 subfolders:

S3 bucket subfolder example

Here comes an example of .jar and .sh files stored in S3:

S3 files example

Keep in mind that S3 is not used for updates though. Once you’ve uploaded a file to S3 then it cannot be updated in a one-step operation. Even if you want to edit a text file there’s no editor for it. You’ll need to delete the old file and upload a new one instead.

Storing our raw data

So how can S3 be used to store our raw data? Suppose that we receive the raw data formatted as JSON:

{
    "CustomerId": "abc123",
    "DateUnixMs": 1416603010000,
    "Activity": "buy",
    "DurationMs": 43253
}

You can store the raw data in text files in S3 in the format you wish. The format will most likely depend on the mechanism that will eventually pull data from S3. Amazon data storage and analysis solutions such as RedShift or Elastic MapReduce (EMR) have been designed to read data from S3 efficiently – we’ll discuss both in this series. So at this stage you’ll need to do some forward thinking:

  • A: What mechanism will need to read from the raw data store for aggregation?
  • B: How can we easily – or relatively easily – read the raw data visually by just opening a raw data file?

For B you might want to store the raw data as it is, i.e. as JSON. E.g. you can have a text file in S3 called “raw-data-customer-abc123-2014-11-10.txt” with the following data points:

{"CustomerId": "abc123", "DateUnixMs": 1416603010000, "Activity": "buy", "DurationMs": 43253}
{"CustomerId": "abc123", "DateUnixMs": 1416603020000, "Activity": "buy", "DurationMs": 53253}
{"CustomerId": "abc123", "DateUnixMs": 1416603030000, "Activity": "buy", "DurationMs": 63253}
{"CustomerId": "abc123", "DateUnixMs": 1416603040000, "Activity": "buy", "DurationMs": 73253}

…i.e. with one data point per line.

However, this format is not suitable for point A above. Other mechanisms will have a hard time understanding this data format. For RedShift and EMR to work most efficiently we’ll need to store the raw data in some delimited fields such as CSV or tab delimited fields. So the above data points can then be stored as follows in a CSV file:

abc123,1416603010000,buy,43253
abc123,1416603020000,buy,53253
abc123,1416603030000,buy,63253
abc123,1416603040000,buy,73253

This is probably OK for point B above as well. It’s not too hard on your eyes to understand this data structure so we’ll settle for that.

File organisation refinements

We now know a little about S3 and have decided to go for a delimited storage format. The next question is how we actually organise our raw data files into buckets and folders. Again, we’ll need to consider points A and B above. In addition you’ll need to consider the frequency of your data aggregations: once a day, every hour, every quarter?

You might first go for a customer ID – or some other ID – based grouping, so e.g. you’ll have a top bucket and sub folders for each customer:

  • Top bucket: “raw-data-points”
  • Subfolder: “customer123”
  • Subfolder: “customer456”
  • Subfolder: “customer789”
  • …etc.

…and within each subfolder you can have subfolders based on dates in the raw data points, e.g.:

  • Sub-subfolder: “2014-10-11”
  • Sub-subfolder: “2014-10-12”
  • Sub-subfolder: “2014-10-13”
  • Sub-subfolder: “2014-10-14”
  • …etc.

That looks very nice and it probably satisfies question B above but not so much regarding question A. This structure is difficult to handle for an aggregation mechanism as you’ll need to provide complex search criteria for the aggregation. In addition, suppose you want to aggregate the data every 30 minutes and you dump all raw data points into one of those sub-subfolders. Then again you’ll need to set up difficult search criteria for the aggregation mechanism to extract just the correct raw data points.

One possible solution is the following:

  • Decide on the minimum aggregation frequency you’d like to support in your system – let’s take 30 minutes for the sake of this discussion
  • Have one dedicated top bucket like “raw-data-points” above
  • Below this top bucket organise the data points into sub folders based on dates

The names of the date sub-folders can be based on the minimum aggregation frequency. You’ll basically put the files into intervals where the date parts are reversed according to the following format:

minute-hour-day-month-year

Examples:

  • 00-13-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 13:00:00 until 13:29:59 inclusive
  • 30-13-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 13:30:00 until 13:59:59 inclusive
  • 00-14-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 14:00:00 until 14:29:59 inclusive

…and so on. Each subfolder can then hold text files with the raw data points. In order to find a particular storage file of a customer you can do some pre-grouping in the Kinesis client application and not just save every data point one by one in S3: group the raw data points according to the customer ID and the date of the data point and save the raw files accordingly. You can then have the following text files in S3:

  • abc123-2014-11-15-13-32-43.txt
  • abc123-2014-11-15-13-32-44.txt
  • abc123-2014-11-15-13-32-45.txt

…where the names follow this format:

customerId-year-month-day-hour-minute-second

So within each file you’ll have the CSV or tab delimited raw data that occurred in that given second. In case you want to go for a minute based pre-grouping then’ll end up with the following files:

  • abc123-2014-11-15-13-31.txt
  • abc123-2014-11-15-13-32.txt
  • abc123-2014-11-15-13-33.txt

…and so on. This is the same format as above but at the level of minutes instead.

Keep in mind that all of the above can be customised based on your data structure. The main point is that S3 is an ideal way to store large amounts of raw data points within the Amazon infrastructure and that you’ll need to carefully think through how to organise your raw data point files so that they are easily handled by an aggregation mechanism.

Finally let’s extend our diagram:

Step 3 Kinesis with KCL and S3

We’ll look at a potential aggregation mechanism in the next post: Elastic MapReduce.

View all posts related to Amazon Web Services here.

Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 1

Introduction

Amazon has a very wide range of cloud-based services. At the time of writing this post Amazon offered the following services:

Amazon web services offerings

It feels like Amazon is building a solution to any software architecture problem one can think of. I have been exposed to a limited number of them professionally and most of them will figure in this series: S3, DynamoDb, Data Pipeline, Elastic Beanstalk, Elastic MapReduce and RedShift. Note that we’ll not see any detailed code examples as that could fill an entire book. Here we only talk about an overall architecture where we build up the data flow step by step. I’m planning to post about how to use these components programmatically using .NET and Java during 2015 anyway.

Here’s a short description of the system we’d like to build:

  • A large amount of incoming messages needs to be analysed – by “large amount” we mean thousands per second
  • Each message contains a data point with some observation important to your business: sales data, response times, price changes, etc.
  • Every message needs to be processed and stored so that they can be analysed – the individual data points may need to be retrieved to control the quality of the analysis
  • The stored messages need to be analysed automatically and periodically – or at random with manual intervention
  • The result of the analysis must be stored in a database where you can view it and possibly draw conclusions

The above scenario is quite generic in Big Data analysis where a large amount of possibly unstructured messages needs to be analysed. The result of the analysis must be meaningful for your business in some way. Maybe they will help you in your decision making process. Also, they can provide useful aggregation data that you can sell to your customers.

There are a number of different product offerings and possible solutions to manage such a system nowadays. We’ll go through one of them in this series.

The raw data

What kind of raw data are we talking about? Any type of textual data you can think of. It’s of course an advantage if you can give some structure to the raw data in the form of JSON, XML, CSV or other delimited data.

On the one hand you can have well-formatted JSON data that hits the entry point of your system:

{
    "CustomerId": "abc123",
    "DateUnixMs": 1416603010000,
    "Activity": "buy",
    "DurationMs": 43253
}

Alternatively the same data can arrive in other forms, such as CSV:

abc123,1416603010000,buy,43253

…or as some arbitrary textual input:

Customer id: abc123
Unix date (ms): 1416603010000
Activity: buy
Duration (ms): 43253

It is perfectly feasible that the raw data messages won’t all follow the same input format. Message 1 may be JSON, message 2 may be XML, message 3 may be formatted like this last example above.

The message handler: Kinesis

Amazon Kinesis is a highly scalable message handler that can easily “swallow” large amounts of raw messages. The home page contains a lot of marketing stuff but there’s a load of documentation available for developers, starting here.

In a nutshell:

  • A Kinesis “channel” is called a stream. A stream has a name that clients can send their messages to and that consumers of the stream can read from
  • Each stream is divided into shards. You can specify the read and write throughput of your Kinesis stream when you set it up in the AWS console
  • A single message can not exceed 50 KB
  • A message is stored in the stream for 24 hours before it’s deleted

You can read more about the limits, such as max number of shards and max throughput here. Kinesis is relatively cheap and it’s an ideal out-of-the-box entry point for big data analysis.

So let’s place the first item on our architecture diagram:

Step 1 Kinesis

The Kinesis consumer

Kinesis stores the messages for only 24 hours and you cannot instruct it to do anything specific with the message. There are no options like “store the message in a database” or “calculate the average value” for a Kinesis stream. The message is loaded into the stream and waits until it is either auto-deleted or pulled by a consumer.

What do we mean by a Kinesis consumer? It is a listener application that pulls the messages out of the stream in batches. The listener is a worker process that continuously monitors the stream. Amazon has SDKs in a large number of languages listed here: Java, Python, .NET etc.

It’s basically any application that processes messages from a given Kinesis stream. If you’re a Java developer then there’s an out-of-the-box starting point on GitHub available here. Here’s the official Amazon documentation of the Kinesis Client Library.

On a general note: most documentation and resources regarding Amazon Web Services will be available in Java. I would recommend that, if you have the choice, then Java should be option #1 for your development purposes.

Let’s extend our diagram where I picked Java as the development platform but in practice it could be any from the available SDKs:

Step 2 Kinesis and KCL

How can this application be deployed? It depends on the technology you’ve selected, but for Java Amazon Beanstalk is a straightforward option. This post gives you details on how to do it.

We’ll continue building our system with a potential raw message store mechanism in the next post: Amazon S3.

View all posts related to Amazon Web Services here.

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

ARCHIVED: Bite-size insight on Cyber Security for the not too technical.