← Monitor the file system with FileSystemWatcher in C# .NET Part 4: other features

How to partially read a file with C# .NET →

Using Amazon S3 with the AWS.NET API Part 5: S3 in Big Data

January 15, 2015 Leave a comment

Introduction

In the previous post we looked at how to work with Amazon S3 folders in code.

This post will take up the Big Data thread where we left off at the end of the previous series on the scalable and powerful message handling service Amazon Kinesis. Therefore the pre-requisite of following the code examples in this post is familiarity with what we discussed there. However, I’ll try to write in a way that those of you who’ve only come here for S3 may get a taste of the role it can play in a cloud-based Big Data architecture. Who knows, you might just learn something useful.

I’ve split the discussion into 2 parts. In this post we’ll decide on our data storage strategy without writing any code. We’ll implement the strategy in the next post.

Reminder

Where were we in the first place? The last post of the Amazon Kinesis series stopped where we stored our raw data in a text file on our local drive like this:

yahoo http://www.yahoo.com GET 432 1417556120657
google http://www.google.com POST 532 1417556133322
bbc http://www.bbc.co.uk GET 543 1417556148276
twitter http://www.twitter.com GET 623 1417556264008
wiki http://www.wikipedia.org POST 864 1417556302529
facebook http://www.facebook.com DELETE 820 1417556319381

We hid the relevant function behind an interface:

public interface IRawDataStorage
{
	void Save(IEnumerable<WebTransaction> webTransactions);
}

Currently we have one implementation of the interface: FileBasedDataStorage. We’ll now use our new S3 skills and create a new implementation.

Strategy

We’ll give some structure to our S3 URL transaction data store. If we dump all observation into the same bucket then it will be difficult for both humans and software to search and aggregate the data.

Let’s recap what we discussed regarding the raw data and why we’re saving it in a tab-delimited format:

The format will most likely depend on the mechanism that will eventually pull data from the raw data store. Data mining and analysis solutions such as Amazon RedShift or Elastic MapReduce (EMR) – which we’ll take up later on – will all need to work with the raw data. So at this stage you’ll need to do some forward thinking:

A: What mechanism will need to read from the raw data store for aggregation?
B: How can we easily – or relatively easily – read the raw data visually by just opening a raw data file?

B is important for debugging purposes if you want to verify the calculations. It’s also important if some customer is interested in viewing the raw data for some time period. For B you might want to store the raw data as it is, i.e. as JSON. E.g. you can have a text file with the following data points:

{"CustomerId": "abc123", "DateUnixMs": 1416603010000, "Activity": "buy", "DurationMs": 43253}
{"CustomerId": "abc123", "DateUnixMs": 1416603020000, "Activity": "buy", "DurationMs": 53253}
{"CustomerId": "abc123", "DateUnixMs": 1416603030000, "Activity": "buy", "DurationMs": 63253}
{"CustomerId": "abc123", "DateUnixMs": 1416603040000, "Activity": "buy", "DurationMs": 73253}

…i.e. with one data point per line.

However, this format is not suitable for point A above. Other mechanisms will have a hard time understanding this data format. For RedShift and EMR to work most efficiently we’ll need to store the raw data in some delimited fields such as CSV or tab delimited fields. So the above data points will then be stored as follows in a tab-delimited file:

abc123     1416603010000    buy    43253
abc123     1416603020000    buy    53253
abc123     1416603030000    buy    63253
abc123     1416603040000    buy    73253

This is probably OK for point B above as well. It’s not too hard on your eyes to understand this data structure so we’ll settle for that. You might ask why we didn’t select some other delimiter, such as a pipe ‘|’ or a comma ‘,’. The answer is that our demo system is based on URLs and URLs can have pipes and commas in them making them difficult to split. Tabs will work better but you are free to choose whatever fits your system best.

File organisation refinements

Say we’d like to find out the average response time of yahoo.com over January 2014. If we put all raw data points in the same bucket with no folders then it will be difficult and time consuming to find the correct data points for data analysis. I took up this topic in another Amazon Big Data related series on this blog. I’ll copy the relevant considerations here with some modifications.

So the question now is how we actually organise our raw data files into buckets and folders. Again, we’ll need to consider points A and B above. In addition you’ll need to consider the frequency of your data aggregations: once a day, every hour, every quarter?

You might first go for a customer ID – or some other ID – based grouping, so e.g. you’ll have a top bucket and sub folders for each customer:

Top bucket: “raw-data-points”
Subfolder: “customer123”
Subfolder: “customer456”
Subfolder: “customer789”
…etc.

…and within each subfolder you can have subfolders based on dates in the raw data points, e.g.:

Sub-subfolder: “2014-10-11”
Sub-subfolder: “2014-10-12”
Sub-subfolder: “2014-10-13”
Sub-subfolder: “2014-10-14”
…etc.

That looks very nice and it probably satisfies question B above but not so much question A. This structure is difficult to handle for an aggregation mechanism as you’ll need to provide complex search criteria for the aggregation. In addition, suppose you want to aggregate the data every 30 minutes and you dump all raw data points into one of those sub-subfolders. Then again you’ll need to set up difficult search criteria for the aggregation mechanism to extract just the correct raw data points.

One possible solution is the following:

Decide on the minimum aggregation frequency you’d like to support in your system – let’s take 30 minutes for the sake of this discussion
Have one dedicated top bucket like “raw-data-points” above
Below this top bucket organise the data points into sub folders based on dates
There will be only one subfolder per period within the top bucket to make data access and searches easier
Each subfolder will contain a number of files which hold the raw data points in a tab delimited format

The names of the date sub-folders can be based on the minimum aggregation frequency. You’ll basically put the files into intervals where the date parts are reversed according to the following format:

minute-hour-day-month-year

Examples:

00-13-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 13:00:00 until 13:29:59 inclusive
30-13-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 13:30:00 until 13:59:59 inclusive
00-14-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 14:00:00 until 14:29:59 inclusive

…and so on. Each subfolder can then hold text files with the raw data points. In order to find a particular storage file of a customer you can do some pre-grouping in the Kinesis client application and not just save every data point one by one in S3: group the raw data points according to the customer ID and the date of the data point and save the raw files accordingly. You can then have the following text files in S3:

abc123-2014-11-15-13-32-43.txt
abc123-2014-11-15-13-32-44.txt
abc123-2014-11-15-13-32-45.txt

…where the names follow this format:

customerId-year-month-day-hour-minute-second

So within each file you’ll have the CSV or tab delimited raw data that occurred in that given second. In case you want to go for a minute based pre-grouping then’ll end up with the following files:

abc123-2014-11-15-13-31.txt
abc123-2014-11-15-13-32.txt
abc123-2014-11-15-13-33.txt

…and so on. This is the same format as above but at the level of minutes instead.

Conclusion

Based on the above let’s go for the following strategy:

Minimum aggregation interval: 30 minutes
One top bucket
Subfolders will follow the rules outlined above, e.g. 00-13-15-11-2014
We’ll group the incoming data points by the minute: customerId-year-month-day-hour-minute

Example: if our top bucket in S3 is called “raw-data” then we’ll have the following file hierarchy:

raw-data bucket
sub-folder 00-13-15-11-2014
Within 00-13-15-11-2014 files like yahoo-2014-11-15-13-15.txt and facebook-2014-11-15-13-16.txt
Another subfolder within ‘raw-data’: 30-13-15-11-2014
Within 30-13-15-11-2014 files like yahoo-2014-11-15-42-15.txt and facebook-2014-11-15-43-16.txt

Keep in mind that all of the above can be customised based on your data structure. The main point is that S3 is an ideal way to store large amounts of raw data points within the Amazon infrastructure and that you’ll need to carefully think through how to organise your raw data point files so that they are easily handled by an aggregation mechanism.

We’ll implement this strategy in the next post.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, s3

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

Exercises in .NET with Andras Nemes

Using Amazon S3 with the AWS.NET API Part 5: S3 in Big Data

Leave a comment Cancel reply

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Blogs I Follow

Exercises in .NET with Andras Nemes

Using Amazon S3 with the AWS.NET API Part 5: S3 in Big Data

Share:

Related

Leave a comment Cancel reply

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Keywords

Blogs I Follow