← Replacing substrings using Regex in C# .NET: date format example

Replacing substrings using Regex in C# .NET: string cleaning example →

Big Data: using Amazon Kinesis with the AWS.NET API Part 1: introduction

December 11, 2014 Leave a comment

Introduction

Big Data is definitely an important buzzword nowadays. Organisations have to process large amounts of information real time in form of messages in order to make decisions about the future of the company. Companies can also use these messages as data points of something they monitor constantly: sales, response times, stock prices, etc. Their goal is presumably to process the data and extract information from it that their customers find useful and are willing to pay for.

Whatever the purpose there must be a system that is able to handle the influx of messages. You don’t want to lose a single message or let a faulty one stop the chain. You’ll want to have the message queue up and running all the time and make it flexible and scalable so that it can scale up and down depending on the current load. Also, ideally you’ll want to start with the “real” work as soon as possible and not spend too much time on infrastructure management: new servers, load balancers, installing message queue systems etc. Depending on your preferences, it may be better to invest in a ready-made service at least for the initial life of your application. If you then decide that the product is not worth the effort then you can simply terminate the service and then probably haven’t lost as much money as if you had to manage the infrastructure yourself from the beginning.

This is the first installment of a series dedicated to out-of-the-box components built and powered by Amazon Web Services (AWS) enabling Big Data handling. In fact it will be a series of series as I’ll divide the different parts of the chain into their own “compartments”:

Message queue
Message persistence
Analysis
Storing the extracted data

Almost all code will be C# with the exception of SQL-like languages in the “Analysis” section. You’ll need to have an account in Amazon Web Services if you want to try the code examples yourself. Amazon has a free-tier of some of their services which is usually enough for testing purposes before your product turns into something serious. Even if there’s no free tier available, like in the case of Kinesis, the costs you incur with minimal tests are far from prohibiting. Amazon is bringing down its prices on AWS components quite often as their volumes grow larger. By signing up with Amazon and creating a user you’ll also get a pair of security keys: an Amazon Access Key and a Secret Access Key.

Note that we’ll be concentrating on showing how to work with the .NET AWS SDK. We won’t organise our code according to guidelines like SOLID and layered architecture – it’s your responsibility to split your code into manageable bits and pieces.

Here we’re starting with the entry point of the system, i.e. the message handler.

Amazon Kinesis

Amazon Kinesis is a highly scalable cloud-based messaging system which can handle extremely large amounts of messages. There’s another series dedicated to a high-level view of a possible Big Data handling architecture. It takes up the same topics as this series but without going down to the code level. If you’re interested in getting the larger picture I really encourage you to check it out. The first post of that series takes up Kinesis so I’ll copy the relevant sections here.

The raw data

What kind of raw data are we talking about? Any type of textual data you can think of. It’s of course an advantage if you can give some structure to the raw data in the form of JSON, XML, CSV or other delimited data.

On the one hand you can have well-formatted JSON data that hits the entry point of your system:

{
    "CustomerId": "abc123",
    "DateUnixMs": 1416603010000,
    "Activity": "buy",
    "DurationMs": 43253
}

Alternatively the same data can arrive in other forms, such as CSV:

abc123,1416603010000,buy,43253

…or as some arbitrary textual input:

Customer id: abc123
Unix date (ms): 1416603010000
Activity: buy
Duration (ms): 43253

It is perfectly feasible that the raw data messages won’t all follow the same input format. Message 1 may be JSON, message 2 may be XML, message 3 may be formatted like this last example above.

The message handler: Kinesis

Amazon Kinesis is a highly scalable message handler that can easily “swallow” large amounts of raw messages. The home page contains a lot of marketing stuff but there’s a load of documentation available for developers, starting here. Most of it is in Java though.

In a nutshell:

A Kinesis “channel” is called a stream. A stream has a name that clients can send their messages to and that consumers of the stream can read from
Each stream is divided into shards. You can specify the read and write throughput of your Kinesis stream when you set it up in the AWS console
A single message can not exceed 50 KB
A message is stored in the stream for 24 hours before it’s deleted

You can read more about the limits, such as max number of shards and max throughput here. Kinesis is relatively cheap and it’s an ideal out-of-the-box entry point for big data analysis.

Kinesis will take a lot of responsibility from your shoulders: scaling, stream and shard management, infrastructure management etc. It’s possible to create a new stream in 5 minutes and you’ll be able to post – actually PUT – messages to that stream immediately after it was created. On the other hand the level of configuration is quite limited which may be both good and bad, it depends on your goals. Examples:

There’s no way to add any logic to the stream in the GUI
You cannot easily limit the messages to the stream, e.g. by defining a message schema so that malformed messages are discarded automatically
You cannot define what should happen to the messages in the stream, e.g. in case you want to do some pre-aggregation

However, I don’t think these are real limitations as other message queue solutions will probably be similar.

In the next post we’ll create a Kinesis stream, install the .NET AWS SDK and define our thin domain.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, kinesis

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

Exercises in .NET with Andras Nemes

Big Data: using Amazon Kinesis with the AWS.NET API Part 1: introduction

Leave a comment Cancel reply

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Blogs I Follow

Exercises in .NET with Andras Nemes

Big Data: using Amazon Kinesis with the AWS.NET API Part 1: introduction

Share:

Related

Leave a comment Cancel reply

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Keywords

Blogs I Follow