Big Data | Exercises in .NET with Andras Nemes

Using Amazon S3 with the AWS.NET API Part 3: code basics cont’d

January 8, 2015 4 Comments

Introduction

In the previous post we looked at some basic code examples for Amazon S3: list all buckets, create a new bucket and upload a file to a bucket.

In this post we’ll continue with some more code examples: downloading a resource, deleting it and listing the available objects.

We’ll extend the AmazonS3Demo C# console application with reading, listing and deleting objects.

Listing files in a bucket

The following method in S3DemoService.sc will list all objects within a bucket:

public void RunObjectListingDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			ListObjectsRequest listObjectsRequest = new ListObjectsRequest();
			listObjectsRequest.BucketName = "a-second-bucket-test";
			ListObjectsResponse listObjectsResponse = s3Client.ListObjects(listObjectsRequest);
			foreach (S3Object entry in listObjectsResponse.S3Objects)
			{
				Console.WriteLine("Found object with key {0}, size {1}, last modification date {2}", entry.Key, entry.Size, entry.LastModified);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Object listing has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

We use a ListObjectsRequest object to retrieve all objects from a bucket by providing the bucket name. For each object we print the key name, the object size and the last modification date, simple as that. In the previous post I uploaded a file called logfile.txt to the bucket called “a-second-bucket-test”. Accordingly, calling this method from Main…

static void Main(string[] args)
{
	S3DemoService demoService = new S3DemoService();
	demoService.RunObjectListingDemo();

	Console.WriteLine("Main done...");
	Console.ReadKey();
}

…yields the following output:

Found object with key logfile.txt, size 4490, last modification date 2014-12-06 13:25:45.

The ListObjectsRequest function provides some basic search functionality. The “Prefix” property will limit the search results to those objects whose names start with that prefix, e.g.:

listObjectsRequest.Prefix = "log";

That will find all objects whose key names start with “log”, i.e. logfile.txt is still listed.

You can list a limited number of elements, say 5:

listObjectsRequest.MaxKeys = 5;

You can also set a marker, meaning that the request will only list the files whose keys come after the marker value alphabetically:

listObjectsRequest.Marker = "leg";

This will find “logfile.txt”. However a marker value of “lug” won’t.

Download a file

Downloading a file from S3 involves reading from a Stream, a standard operation in the world of I/O. The following function will load the stream from logfile.txt, print its metadata and convert the downloaded byte array into a string:

public void RunDownloadFileDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			GetObjectRequest getObjectRequest = new GetObjectRequest();
			getObjectRequest.BucketName = "a-second-bucket-test";
			getObjectRequest.Key = "logfile.txt";
			GetObjectResponse getObjectResponse = s3Client.GetObject(getObjectRequest);
			MetadataCollection metadataCollection = getObjectResponse.Metadata;

			ICollection<string> keys = metadataCollection.Keys;
			foreach (string key in keys)
			{
				Console.WriteLine("Metadata key: {0}, value: {1}", key, metadataCollection[key]);
			}

			using (Stream stream = getObjectResponse.ResponseStream)
			{
				long length = stream.Length;
				byte[] bytes = new byte[length];
				int bytesToRead = (int)length;
				int numBytesRead = 0;
				do
				{
					int chunkSize = 1000;
					if (chunkSize > bytesToRead)
					{
						chunkSize = bytesToRead;
					}
					int n = stream.Read(bytes, numBytesRead, chunkSize);
					numBytesRead += n;
					bytesToRead -= n;
				}
				while (bytesToRead > 0);
				String contents = Encoding.UTF8.GetString(bytes);
				Console.WriteLine(contents);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Object download has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Run it from Main:

demoService.RunDownloadFileDemo();

In my case there was only one metadata entry:

Metadata key: x-amz-meta-type, value: log

…which is the one I attached to the file in the previous post.

In the above case we know beforehand that we’re reading text so the bytes could be converted into a string. However, this is of course not necessarily the case as you can store any file type on S3. The GetObjectResponse has a method which allows you to save the stream into a file:

getObjectResponse.WriteResponseStreamToFile("full file path");

…which has an overload to append the stream contents to an existing file:

getObjectResponse.WriteResponseStreamToFile("full file path", true);

Deleting a file

Deleting an object from S3 is just as straightforward as uploading it:

public void RunFileDeletionDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			DeleteObjectRequest deleteObjectRequest = new DeleteObjectRequest();
			deleteObjectRequest.BucketName = "a-second-bucket-test";
			deleteObjectRequest.Key = "logfile.txt";
			DeleteObjectResponse deleteObjectResponse = s3Client.DeleteObject(deleteObjectRequest);
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Object deletion has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Calling this from Main…

demoService.RunFileDeletionDemo();

…removes the previously uploaded logfile.txt from the bucket:

In the next post we’ll see how to work with folders in code.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, s3

Using Amazon S3 with the AWS.NET API Part 2: code basics

January 5, 2015 4 Comments

Introduction

In the previous post we looked at an overview of Amazon S3 and we also tried a couple of simple operations on the GUI. In this post we’ll start coding. If you followed along the previous series on Amazon Kinesis then the setup section will be familiar to you. Otherwise I’ll assume that you are starting this series without having read anything else on this blog.

Note that we’ll be concentrating on showing and explaining the technical code examples related to AWS. We’ll ignore software principles like SOLID and layering so that we can stay focused. It’s your responsibility to organise your code properly. There are numerous posts on this blog that take up topics related to software architecture.

Installing the SDK

The Amazon .NET SDK is available through NuGet. Open Visual Studio 2012/2013 and create a new C# console application called AmazonS3Demo. The purpose of this application will be to demonstrate the different parts of the SDK around S3. In reality the S3 handler could be any type of application:

A website
A Windows/Android/iOS app
A Windows service
etc.

…i.e. any application that’s capable of sending HTTP/S requests to a web service endpoint. We’ll keep it simple and not waste time with view-related tasks.

Install the following NuGet package:

Preparations

We cannot just call the services within the AWS SDK without proper authentication. This is an important reference page to handle your credentials in a safe way. We’ll the take the recommended approach and create a profile in the SDK Store and reference it from app.config.

This series is not about AWS authentication so we won’t go into temporary credentials but later on you may be interested in that option too. Since we’re programmers and it takes a single line of code to set up a profile we’ll go with the programmatic options. Add the following line to Main:

Amazon.Util.ProfileManager.RegisterProfile("demo-aws-profile", "your access key id", "your secret access key");

I suggest you remove the code from the application later on in case you want to distribute it. Run the application and it should execute without exceptions. Next open app.config and add the appSettings section with the following elements:

<appSettings>
        <add key="AWSProfileName" value="demo-aws-profile"/>
</appSettings>

First demo: listing the available buckets

We’ll put all our test code into a separate class. Insert a cs file called S3DemoService. We’ll need a method to build a handle to the service which is of type IAmazonS3:

private IAmazonS3 GetAmazonS3Client()
{
	return Amazon.AWSClientFactory.CreateAmazonS3Client(RegionEndpoint.EUWest1);
}

Note that we didn’t need to provide our credentials here. They will be extracted automatically using the profile name in the config file.

Let’s first find out what buckets we have in S3. This is almost a trivial task:

public void RunBucketListingDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			ListBucketsResponse response = s3Client.ListBuckets();
			List<S3Bucket> buckets = response.Buckets;
			foreach (S3Bucket bucket in buckets)
			{
				Console.WriteLine("Found bucket name {0} created at {1}", bucket.BucketName, bucket.CreationDate);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Bucket listing has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode );
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

We use the client to list all the available buckets. The method returns a ListBucketsResponse object which will hold the buckets.

You’ll see a lot of these Request and Response objects throughout the AWS SDK. Amazon are fond of wrapping the request parameters and response properties into Request and Response objects adhering to the RequestResponse pattern.

We then list the name and creation date of our buckets. Call this method from Main:

static void Main(string[] args)
{
	S3DemoService demoService = new S3DemoService();
	demoService.RunBucketListingDemo();

	Console.WriteLine("Main done...");
	Console.ReadKey();
}

Example output:

Found bucket name a-first-bucket-test created at 2014-12-03 22:30:59

Second demo: creating a new bucket

Creating a new bucket is equally easy. Add the following method to S3DemoService:

public void RunBucketCreationDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			PutBucketRequest putBucketRequest = new PutBucketRequest();
			String newBucketName = "a-second-bucket-test";
			putBucketRequest.BucketName = newBucketName;
			PutBucketResponse putBucketResponse = s3Client.PutBucket(putBucketRequest);					
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Bucket creation has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Again, we can see a Request and a corresponding Response object to create a bucket. We only specify a name which means that we go with the default values for e.g. permissions, they are usually fine. PutBucketRequest provides some properties to indicate values not adhering to the default ones. E.g. here’s how to give Everyone the permission to view the bucket:

S3Grant grant = new S3Grant();
S3Permission permission = new S3Permission("List");
S3Grantee grantee = new S3Grantee();
grantee.CanonicalUser = "Everyone";
grant.Grantee = grantee;
grant.Permission = permission;
List<S3Grant> grants = new List<S3Grant>() { grant };
putBucketRequest.Grants = grants;

Call RunBucketCreationDemo from Main as follows:

demoService.RunBucketCreationDemo();

Run the application and the bucket should be created:

It’s not allowed to have two buckets with the same name. Run the application again and you should get an exception:

Bucket creation has failed.
Amazon error code: BucketAlreadyOwnedByYou
Exception message: Your previous request to create the named bucket succeeded and you already own it.

Third demo: file creation

Let’s upload a text file to the bucket we’ve just created. To be exact, we’ll upload its contents. Create some text file on your hard drive such as c:\logfile.txt and add some text to it. This example shows how to set the contents of the request. We also set an arbitrary metadata key-value pair of key “type” and value “log”. You can set any type of metadata you want:

public void RunFileUploadDemo()
{
	FileInfo filename = new FileInfo( @"c:\logfile.txt");
	string contents = File.ReadAllText(filename.FullName);
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			PutObjectRequest putObjectRequest = new PutObjectRequest();
			putObjectRequest.ContentBody = contents;
			putObjectRequest.BucketName = "a-second-bucket-test";
			putObjectRequest.Metadata.Add("type", "log");
			putObjectRequest.Key = filename.Name;
			PutObjectResponse putObjectResponse = s3Client.PutObject(putObjectRequest);
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("File creation has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Call this method from Main:

demoService.RunFileUploadDemo();

Run the application and the file with its custom metadata should be visible in S3:

We’ll continue with more examples in the next post.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, s3

Using Amazon S3 with the AWS.NET API Part 1: introduction

January 1, 2015 2 Comments

Introduction

Cloud-based blob storage solutions are abound and Amazon Web Services (AWS) is the leader – or one of the leaders – in that area. Amazon S3 (Simple Storage Service) provides a “secure, durable, highly-scalable object storage” solution, as it is stated on the homepage. You can use S3 to store just about any type of file: images, text files, videos, JAR files, HTML pages etc. All files are stored in a key-value map, i.e. each file has a key where the file itself is the value.

Purpose of S3

S3 is often used to store static components of web pages such as images or videos. S3 can be integrated with other Amazon components such as RedShift and Elastic MapReduce. It can also be used to transfer large amounts of data from one component to another.

However, in this series we’ll be concentrating on a slightly different but very specific usage: saving, deleting and checking for the existence of text based storage files. S3 can function as an important building block in a Big Data analysis system where a data mining application can pull the raw data from an S3 bucket, i.e. a container of files.

In S3 you can organise your files into pseudo-folders. I wrote “pseudo” as they are not real folders like the ones we create on Windows. The are rather used as visual containers so that you can organise your files in a meaningful way instead of putting all of them under the same bucket. Examples:

s3:\\sales\january-2015\monthly-sales.txt
s3:\\sales\february-2015\monthly-sales.txt

“sales” is the top bucket, “january-2015” is folder and then we have the file itself. You are free to create subfolders within each folder and subfolders within the subfolders etc.

Keep in mind that S3 is not used for updates though. Once you’ve uploaded a file to S3 then it cannot be updated in a one-step operation. Even if you want to edit a text file there’s no editor for it. You’ll need to delete the old file and upload a new one instead.

Amazon have also done a great job at providing SDKs for a range of platforms, like .NET, Java, Python etc. Programmatic access to S3’s services is equally available.

As with many other Amazon components don’t assume that S3 can only be used with other Amazon components. Any software capable of executing HTTP calls can use to access S3: Web API, .NET MVC, iOS, Java desktop apps, Windows services, you name it. Mixed architecture is quickly becoming the norm nowadays and S3 can be used both as part of a mixed solution or as an Amazon-only architecture.

Goals of this series

The goals of this series are two-fold:

Provide basic UI and programmatic knowledge to anyone looking for a fast, cheap, reliable and scalable cloud-based blob storage solution
Tie in with the previous series about another Amazon service called Kinesis which provides a reliable message queue service as entry point to a Big Data data mining system. We’ll take up S3 as an alternative for raw data storage towards the end of the series. The series on Kinesis ended where we stored the raw data on a local text file – we’ll replace it with S3 based storage

I’ll try to keep the two goals as separate as possible so that all readers with different motivations can follow through – those of you who are only interested in S3 as well as those that will see the greater picture.

I’ll assume that you have at least a test account of Amazon Web Services including the necessary access keys: an Amazon Access Key and a Secret Access Key. You’ll need to sign up with Amazon and then sign up for S3 within Amazon. You can create a free account on the S3 home page using the Create Free Account link button:

Amazon has a lot of great documentation online. Should you get stuck you’ll almost always find an answer there. Don’t be afraid to ask in the comments section below.

In this post we’ll take an easy start and go through some of the visual aspects of S3.

S3 GUI

Log onto AWS and locate the S3 service link:

Before we go anywhere it’s important to mention regions in Amazon. A region in Amazon Web Services is the geographical location of the data centre where your selected service will be located. E.g. if you create a cloud-based server with Amazon EC2 in the region called US East then the server will be set up North Virginia. It doesn’t mean that your home page deployed on that server won’t be reachable anywhere else, it only means the physical location of the service. So if you’re expecting the bulk of your customers to come from Japan then it’s wise to set up the first web server in the region called Asia Pacific (Tokyo). Also, if you set up a service in e.g. EU (Ireland) on the AWS UI, then log out and log in again, your service may not be visible at first. A good guess is that you need to select the correct region. The region is indicated in the URL, e.g.:

https://console.aws.amazon.com/s3/home?region=eu-west-1#

…where “eu-west-1” stands for EU (Ireland). You can normally select the region in the top right hand corner of the Amazon UI where you’ll see the user-friendly names of the regions.

Regions are important for virtually all the components in AWS. Take a wild guess, which component is an exception. Check out the top right hand corner of the S3 screen:

So regions don’t play the same role here as in other components in the AWS product offering. However, the selected region can still be used to “optimize for latency, minimize costs, or address regulatory requirements” as it says in the Create a Bucket windows we’ll soon see.

You’ll see that by default the screen will show all the top buckets:

Let’s try a couple of things to get our hands dirty. Click on the Create Bucket button. Give the bucket a name, select the nearest region to your location and press Create. We’ll skip logging right now:

The bucket is quickly created:

On the right hand side of the screen you can set various properties of the bucket such as logging and security:

We won’t go into them at all otherwise this series will lose its scope. The default values are usually fine for most purposes. If you ever need to modify the settings, especially those that have to do with permissions, then consult the AWS documentation of S3.

Click on the name of the bucket and you’ll see that it’s empty:

Click the Upload button and click Add Files:

Select some file on your hard drive, preferably a text file as they are easier to open. Click “Start Upload” and the file upload progress should appear on the right hand side:

Click the file name to select it. The Actions drop-down will list several options available for a file such as Open, Download or Delete which are self-explanatory.

Now click the “Create Folder” button and give the folder a name:

You can click the folder name to open its contents – much like you do it with a double-click on the Windows file system. You can then upload different files in that folder and create subfolders.

This is enough for starters. We’ll start looking into some basic operations in code in the next post.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, s3

Using Amazon Kinesis with the AWS.NET API Part 6: storage

December 29, 2014 1 Comment

Introduction

In the previous post we added some validation to our demo message handling application. Validation adds some sanity checks to our logic so that bogus inputs are discarded.

In this post, which will be the last in the series on Amazon Kinesis, we’ll be looking at storage. We’ll save the data on disk, which in itself is not too interesting, but we’ll also discuss some formats that are suitable for further processing.

Formats

It is seldom that we’re saving data just to fill up a data store. This is true in our case as well. We’re getting the messages from the Kinesis stream and we’ll be soon saving them. However, we’ll certainly want to perform some actions on the data, such as data aggregations:

Calculate the average response time for http://www.bbc.co.uk/africa between 12:15 to 12:30 on 13 January 2015 for users with Firefox 11
Calculate max response time in week 45 2014 for the domain cnn.com for users located in Seattle
Calculate the 99th percentile of the response time for http://www.twitter.com for February 2014

…etc. Regardless of where you’re planning to save the data, such as a traditional relational DB like MS SQL or a NoSql DB such as MongoDb, you’ll need to plan on the storage format i.e. what tables, collections, columns and datatypes you’ll need. As the next Amazon component we’ll take up on this blog is the blob storage S3 we’ll be concentrating on storing the raw data points in a text file. At first this may seem like a very bad idea but S3 is a very efficient, durable and scalable storage. However, don’t assume that this is a must for your Big Data system to work, you can save your data the way you want. Here we’re just paving the way for the next step.

As mentioned before in this series I have another, higher-level set of posts dedicated to Amazon architecture available here. I took up a similar topic there about message formats, I’ll re-use some of those explanations below.

The format will most likely depend on the mechanism that will eventually pull data from the raw data store. Data mining and analysis solutions such as Amazon RedShift or Elastic MapReduce (EMR) – which we’ll take up later on – will all need to work with the raw data. So at this stage you’ll need to do some forward thinking:

A: What mechanism will need to read from the raw data store for aggregation?
B: How can we easily – or relatively easily – read the raw data visually by just opening a raw data file?

B is important for debugging purposes if you want to verify the calculations. It’s also important if some customer is interested in viewing the raw data for some time period. For B you might want to store the raw data as it is, i.e. as JSON. E.g. you can have a text file with the following data points:

{"CustomerId": "abc123", "DateUnixMs": 1416603010000, "Activity": "buy", "DurationMs": 43253}
{"CustomerId": "abc123", "DateUnixMs": 1416603020000, "Activity": "buy", "DurationMs": 53253}
{"CustomerId": "abc123", "DateUnixMs": 1416603030000, "Activity": "buy", "DurationMs": 63253}
{"CustomerId": "abc123", "DateUnixMs": 1416603040000, "Activity": "buy", "DurationMs": 73253}

…i.e. with one data point per line.

However, this format is not really suitable for point A above. Other mechanisms will have a hard time understanding this data format. For RedShift and EMR to work most efficiently we’ll need to store the raw data in some delimited fields such as CSV or tab delimited fields. So the above data points will then be stored as follows in a tab-delimited file:

abc123     1416603010000    buy    43253
abc123     1416603020000    buy    53253
abc123     1416603030000    buy    63253
abc123     1416603040000    buy    73253

This is probably OK for point B above as well. It’s not too hard on your eyes to understand this data structure so we’ll settle for that. You might ask why we didn’t select some other delimiter, such as a pipe ‘|’ or a comma ‘,’. The answer is that our demo system is based on URLs and URLs can have pipes and commas in them making them difficult to split. Tabs will work better but you are free to choose whatever fits your system best.

Implementation

This time we’ll hide the implementation of the storage mechanism behind an interface. It will be a forward-looking solution where we’ll be able to easily switch between the concrete implementations. Open the demo C# application we’ve been working on so far and locate the WebTransaction object in the AmazonKinesisConsumer application. We’ll add a method to create a tab-delimited string out of its properties:

public string ToTabDelimitedString()
{
	StringBuilder sb = new StringBuilder();
	sb.Append(CustomerName)
		.Append("\t")
		.Append(Url)
		.Append("\t")
		.Append(WebMethod)
		.Append("\t")
		.Append(ResponseTimeMs)
		.Append("\t")
		.Append(UtcDateUnixMs);
	return sb.ToString();
}

Create a text file on your hard drive, like c:\raw-data\storage.txt. Add the following interface to AmazonKinesisConsumer:

public interface IRawDataStorage
{
	void Save(IEnumerable<WebTransaction> webTransactions);
}

…and also the following file based implementation:

public class FileBasedDataStorage : IRawDataStorage
{
	private readonly FileInfo _fileName;

	public FileBasedDataStorage(string fileFullPath)
	{
		if (string.IsNullOrEmpty(fileFullPath)) throw new ArgumentNullException("File full path");
		_fileName = new FileInfo(fileFullPath);
		if (!_fileName.Exists)
		{
			throw new ArgumentException(string.Concat("Provided file path ", fileFullPath, " does not exist."));
		}			
	}
		
	public void Save(IEnumerable<WebTransaction> webTransactions)
	{
		StringBuilder stringBuilder = new StringBuilder();
		foreach (WebTransaction wt in webTransactions)
		{
			stringBuilder.Append(wt.ToTabDelimitedString()).Append(Environment.NewLine);
		}

		using (StreamWriter sw = File.AppendText(_fileName.FullName))
		{
			sw.Write(stringBuilder.ToString());
		}
	}
}

The implementation of the Save method should be quite straightforward. We build a string with the tab delimited representation of the WebTransaction object which is then appended to the source file.

Here comes the updated ReadFromStream() method:

private static void ReadFromStream()
{
	IRawDataStorage rawDataStorage = new FileBasedDataStorage(@"c:\raw-data\storage.txt");
	AmazonKinesisConfig config = new AmazonKinesisConfig();
	config.RegionEndpoint = Amazon.RegionEndpoint.EUWest1;
	AmazonKinesisClient kinesisClient = new AmazonKinesisClient(config);
	String kinesisStreamName = ConfigurationManager.AppSettings["KinesisStreamName"];

	DescribeStreamRequest describeRequest = new DescribeStreamRequest();
	describeRequest.StreamName = kinesisStreamName;

	DescribeStreamResponse describeResponse = kinesisClient.DescribeStream(describeRequest);
	List<Shard> shards = describeResponse.StreamDescription.Shards;

	foreach (Shard shard in shards)
	{
		GetShardIteratorRequest iteratorRequest = new GetShardIteratorRequest();
		iteratorRequest.StreamName = kinesisStreamName;
		iteratorRequest.ShardId = shard.ShardId;
		iteratorRequest.ShardIteratorType = ShardIteratorType.TRIM_HORIZON;

		GetShardIteratorResponse iteratorResponse = kinesisClient.GetShardIterator(iteratorRequest);
		string iteratorId = iteratorResponse.ShardIterator;

		while (!string.IsNullOrEmpty(iteratorId))
		{
			GetRecordsRequest getRequest = new GetRecordsRequest();
			getRequest.Limit = 1000;
			getRequest.ShardIterator = iteratorId;

			GetRecordsResponse getResponse = kinesisClient.GetRecords(getRequest);
			string nextIterator = getResponse.NextShardIterator;
			List<Record> records = getResponse.Records;

			if (records.Count > 0)
			{
				Console.WriteLine("Received {0} records. ", records.Count);
				List<WebTransaction> newWebTransactions = new List<WebTransaction>();
				foreach (Record record in records)
				{
					string json = Encoding.UTF8.GetString(record.Data.ToArray());
					try
					{
						JToken token = JContainer.Parse(json);
						try
						{									
							WebTransaction wt = JsonConvert.DeserializeObject<WebTransaction>(json);
							List<string> validationErrors = wt.Validate();
							if (!validationErrors.Any())
							{
								Console.WriteLine("Valid entity: {0}", json);
								newWebTransactions.Add(wt);
							}
							else
							{
								StringBuilder exceptionBuilder = new StringBuilder();
								exceptionBuilder.Append("Invalid WebTransaction object from JSON: ")
									.Append(Environment.NewLine).Append(json)
									.Append(Environment.NewLine).Append("Validation errors: ")
									.Append(Environment.NewLine);
								foreach (string error in validationErrors)
								{
									exceptionBuilder.Append(error).Append(Environment.NewLine);																										
								}
								Console.WriteLine(exceptionBuilder.ToString());
							}									
						}
						catch (Exception ex)
						{
							//simulate logging
							Console.WriteLine("Could not parse the following message to a WebTransaction object: {0}", json);
						}
					}
					catch (Exception ex)
					{
						//simulate logging
						Console.WriteLine("Could not parse the following message, invalid json: {0}", json);
					}
				}

				if (newWebTransactions.Any())
				{
					try
					{
						rawDataStorage.Save(newWebTransactions);
						Console.WriteLine("Saved all new web transactions to the data store.");
					}
					catch (Exception ex)
					{
						Console.WriteLine("Failed to save the web transactions to file: {0}", ex.Message);
					}
				}
			}

			iteratorId = nextIterator;
		}
	}
}

Run both the consumer and producer applications and send a couple of web transactions to Kinesis. You should end up with the tab delimited observations in the storage file. In my case I have the following:

yahoo http://www.yahoo.com GET 432 1417556120657
google http://www.google.com POST 532 1417556133322
bbc http://www.bbc.co.uk GET 543 1417556148276
twitter http://www.twitter.com GET 623 1417556264008
wiki http://www.wikipedia.org POST 864 1417556302529
facebook http://www.facebook.com DELETE 820 1417556319381

This concludes our discussion of Amazon Kinesis. We’ve also set the path for the next series where we’ll be looking into Amazon S3. If you’re interested in a full Big Data chain using cloud-based Amazon components then you’re more than welcome to read on.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, kinesis

Using Amazon Kinesis with the AWS.NET API Part 5: validation

December 25, 2014 Leave a comment

Introduction

In the previous post we got as far as having a simple but functioning messaging system. The producer and client apps are both console based and the message handler is the ready-to-use Amazon Kinesis. We have a system that we can built upon and scale up as the message load increases. Kinesis streams can be scaled to handle virtually unlimited amounts of messages.

This post on Kinesis will discuss message validation.

You’ll need to handle the incoming messages from the stream. Normally they should follow the specified format, such as JSON or XML with the predefined property names and casing. However, this is not always guaranteed as Kinesis does not itself validate any incoming message. Also, your system might be subject to fake data. So you’ll almost always need to have some message validation in place and log messages that cannot be processed or are somehow invalid.

Open the demo application we’ve been working on so far and let’s get to it.

Validation

We ended up with the following bit of code in AmazonKinesisConsumer:

if (records.Count > 0)
{
	Console.WriteLine("Received {0} records. ", records.Count);
	foreach (Record record in records)
	{
		string json = Encoding.UTF8.GetString(record.Data.ToArray());
		Console.WriteLine("Json string: " + json);
	}
}

We’ll build up the new code step by step and present the new version of the ReadFromStream() method at the end.

Our first task is to check if “json” is in fact valid JSON. There’s no dedicated method for that in JSON.NET so we’ll just see if the string can be parsed into a generic JToken:

string json = Encoding.UTF8.GetString(record.Data.ToArray());
try
{
        JToken token = JContainer.Parse(json);
}
catch (Exception ex)
{
        //simulate logging
	Console.WriteLine("Could not parse the following message, invalid json: {0}", json);
}

Normally every message that cannot be parsed should be logged and analysed. Here we just print the unparseable message to the console. If you’re interested in logging you can check out the posts on this blog here and here.

Next we want to parse the JSON into a WebTransaction object:

try
{
	JToken token = JContainer.Parse(json);
        try
	{
		WebTransaction wt = JsonConvert.DeserializeObject<WebTransaction>(json);
	}
	catch (Exception ex)
	{
		//simulate logging
		Console.WriteLine("Could not parse the following message to a WebTransaction object: {0}", json);
	}
}
catch (Exception ex)
{
	//simulate logging
	Console.WriteLine("Could not parse the following message, invalid json: {0}", json);
}

Next we can perform some validation on the object itself. We’ll make up some arbitrary rules:

The web method can only be one of the following: GET, POST, PUT, HEAD, DELETE, OPTIONS, TRACE, CONNECT
Acceptable range for response times: 0-30000 ms, probably not wide enough, but it’s OK for now
We only accept valid URLs using a validator function I’ve found here. It might not be perfect but at least we can filter out useless inputs like “this is spam” or “you’ve been hacked”

We’ll add the validation rules to WebTransaction.cs of the AmazonKinesisConsumer app:

public class WebTransaction
{
	private string[] _validMethods = { "get", "post", "put", "delete", "head", "options", "trace", "connect" };
	private int _minResponseTimeMs = 0;
	private int _maxResponseTimeMs = 30000;

        public long UtcDateUnixMs { get; set; }
	public string CustomerName { get; set; }
	public string Url { get; set; }
	public string WebMethod { get; set; }
	public int ResponseTimeMs { get; set; }

	public List<string> Validate()
	{
		List<string> brokenRules = new List<string>();
		if (!IsWebMethodValid())
		{
			brokenRules.Add(string.Format("Invalid web method: {0}", WebMethod));
		}
		if (!IsResponseTimeValid())
		{
			brokenRules.Add(string.Format("Response time outside acceptable limits: {0}", ResponseTimeMs));
		}
		if (!IsValidUrl())
		{
			brokenRules.Add(string.Format("Invalid URL: {0}", Url));
		}
		return brokenRules;
	}

	private bool IsWebMethodValid()
	{
		return _validMethods.Contains(WebMethod.ToLower());
	}

	private bool IsResponseTimeValid()
	{
		if (ResponseTimeMs < _minResponseTimeMs
			|| ResponseTimeMs > _maxResponseTimeMs)
		{
			return false;
		}
        	return true;
	}

	private bool IsValidUrl()
	{
		Uri uri;
		string urlToValidate = Url;
		if (!urlToValidate.Contains(Uri.SchemeDelimiter)) urlToValidate = string.Concat(Uri.UriSchemeHttp, Uri.SchemeDelimiter, urlToValidate);
		if (Uri.TryCreate(urlToValidate, UriKind.RelativeOrAbsolute, out uri))
		{
			try
			{
				if (Dns.GetHostAddresses(uri.DnsSafeHost).Length > 0)
				{
					return true;
				}
			}
			catch
			{
				return false;
			}
		}

		return false; 
	}

}

The Validate method will collect all validation errors. IsWebMethodValid() and IsResponseTimeValid() should be quite straightforward. If you don’t understand the IsValidUrl function check out the StackOverflow link referred to above.

We can use the Validate method from within the ReadFromStream() method as follows:

List<WebTransaction> newWebTransactions = new List<WebTransaction>();
foreach (Record record in records)
{
	string json = Encoding.UTF8.GetString(record.Data.ToArray());
	try
	{
        	JToken token = JContainer.Parse(json);
		try
		{									
			WebTransaction wt = JsonConvert.DeserializeObject<WebTransaction>(json);
          		List<string> validationErrors = wt.Validate();
			if (!validationErrors.Any())
			{
				Console.WriteLine("Valid entity: {0}", json);
				newWebTransactions.Add(wt);
			}
			else
			{
				StringBuilder exceptionBuilder = new StringBuilder();
				exceptionBuilder.Append("Invalid WebTransaction object from JSON: ")
				.Append(Environment.NewLine).Append(json)
				.Append(Environment.NewLine).Append("Validation errors: ")
				.Append(Environment.NewLine);
				foreach (string error in validationErrors)
				{
					exceptionBuilder.Append(error).Append(Environment.NewLine);																										
				}
				Console.WriteLine(exceptionBuilder.ToString());
			}									
		}
        	catch (Exception ex)
		{
			//simulate logging
			Console.WriteLine("Could not parse the following message to a WebTransaction object: {0}", json);
		}
	}
	catch (Exception ex)
	{
		//simulate logging
		Console.WriteLine("Could not parse the following message, invalid json: {0}", json);
	}
}

As you can see we’re also collecting all valid WebTransaction objects into a list. That’s a preparation for the next post where we’ll store the valid objects on disk.

Here’s the current version of the ReadFromStream method:

private static void ReadFromStream()
{
	AmazonKinesisConfig config = new AmazonKinesisConfig();
	config.RegionEndpoint = Amazon.RegionEndpoint.EUWest1;
	AmazonKinesisClient kinesisClient = new AmazonKinesisClient(config);
	String kinesisStreamName = ConfigurationManager.AppSettings["KinesisStreamName"];

	DescribeStreamRequest describeRequest = new DescribeStreamRequest();
	describeRequest.StreamName = kinesisStreamName;

	DescribeStreamResponse describeResponse = kinesisClient.DescribeStream(describeRequest);
	List<Shard> shards = describeResponse.StreamDescription.Shards;

	foreach (Shard shard in shards)
	{
		GetShardIteratorRequest iteratorRequest = new GetShardIteratorRequest();
		iteratorRequest.StreamName = kinesisStreamName;
		iteratorRequest.ShardId = shard.ShardId;
		iteratorRequest.ShardIteratorType = ShardIteratorType.TRIM_HORIZON;

		GetShardIteratorResponse iteratorResponse = kinesisClient.GetShardIterator(iteratorRequest);
		string iteratorId = iteratorResponse.ShardIterator;

		while (!string.IsNullOrEmpty(iteratorId))
		{
			GetRecordsRequest getRequest = new GetRecordsRequest();
			getRequest.Limit = 1000;
			getRequest.ShardIterator = iteratorId;

			GetRecordsResponse getResponse = kinesisClient.GetRecords(getRequest);
			string nextIterator = getResponse.NextShardIterator;
			List<Record> records = getResponse.Records;

			if (records.Count > 0)
			{
				Console.WriteLine("Received {0} records. ", records.Count);
				List<WebTransaction> newWebTransactions = new List<WebTransaction>();
				foreach (Record record in records)
				{
					string json = Encoding.UTF8.GetString(record.Data.ToArray());
					try
					{
						JToken token = JContainer.Parse(json);
						try
						{									
							WebTransaction wt = JsonConvert.DeserializeObject<WebTransaction>(json);
							List<string> validationErrors = wt.Validate();
							if (!validationErrors.Any())
							{
								Console.WriteLine("Valid entity: {0}", json);
								newWebTransactions.Add(wt);
							}
							else
							{
								StringBuilder exceptionBuilder = new StringBuilder();
								exceptionBuilder.Append("Invalid WebTransaction object from JSON: ")
									.Append(Environment.NewLine).Append(json)
									.Append(Environment.NewLine).Append("Validation errors: ")
									.Append(Environment.NewLine);
								foreach (string error in validationErrors)
								{
									exceptionBuilder.Append(error).Append(Environment.NewLine);																										
								}
								Console.WriteLine(exceptionBuilder.ToString());
							}									
						}
						catch (Exception ex)
						{
							//simulate logging
							Console.WriteLine("Could not parse the following message to a WebTransaction object: {0}", json);
						}
					}
					catch (Exception ex)
					{
						//simulate logging
						Console.WriteLine("Could not parse the following message, invalid json: {0}", json);
					}
				}
			}

			iteratorId = nextIterator;
		}
	}
}

Run the application with F5. This will start the project that is set as the start-up project. You can start the other one using the technique we saw in the previous post: right-click, Debug, Start new instance. You’ll have two console windows running. If you had some messages left in the Kinesis stream then they should be validated now. I can see the following output:

Let’s now send some new messages to Kinesis:

Great, we have some basic validation logic in place.

We’ll discuss storing the messages in the next post which will finish the series on Amazon Kinesis.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, kinesis

Big Data: using Amazon Kinesis with the AWS.NET API Part 4: reading from the stream

December 22, 2014 7 Comments

Introduction

In the previous post of this series on Amazon Kinesis we looked at how to publish messages to a Kinesis stream. In this post we’ll see how to extract them. We’ll create a Kinesis Client application.

It’s necessary to extract the messages from the stream as it only stores them for 24 hours. Also, a client application can filter, sort and validate the incoming messages according to some pre-defined rules.

Our demo client will be a completely separate application. We’ll see some duplication of code but that has a good reason. We’ll want to simulate a scenario where the producers are completely different applications, such as a bit of JavaScript on a web page, a Java web service, an iOS app or some other smart device. Our Kinesis producer is good for demo purposes but in reality the producer can be any software that can send HTTP requests. However, if both your producer and client apps are of the same platform then of course go ahead and introduce a common layer in the project.

Open the demo app we’ve been working on and let’s get to it.

The Kinesis client

Add a new C# console application called AmazonKinesisConsumer. Add the same NuGet packages as before:

Add a reference to the System.Configuration library already now. Also, add the same configurations to app.config:

<appSettings>
        <add key="AWSProfileName" value="demo-aws-profile"/>
	<add key="KinesisStreamName" value="test-stream"/>
</appSettings>

Insert the same WebTransaction object again:

public class WebTransaction
{
	public long UtcDateUnixMs { get; set; }
	public string CustomerName { get; set; }
	public string Url { get; set; }
	public string WebMethod { get; set; }
	public int ResponseTimeMs { get; set; }
}

We’ll make it easy for us here and re-use the same WebTransaction object as we know that we’ll be able to parse the incoming JSON string. However, as mentioned in the first post of this series, be prepared for different message formats and property names. If you can, always aim for some well accepted standard such as JSON or XML, they are easy to handle in code. E.g. if the incoming JSON has different names – including variations in casing – then you can use the JSON library to match the property names:

public class WebTransaction
{
	[JsonProperty(PropertyName="dateUtc")]
	public long UtcDateUnixMs { get; set; }
	[JsonProperty(PropertyName = "cust")]
	public string CustomerName { get; set; }
	[JsonProperty(PropertyName = "url")]
	public string Url { get; set; }
	[JsonProperty(PropertyName = "method")]
	public string WebMethod { get; set; }
	[JsonProperty(PropertyName = "responseTime")]
	public int ResponseTimeMs { get; set; }
}

In any case you can assume that the messages will come in as strings – or bytes that can be converted to strings to be exact.

Do not assume anything about the ordering of the messages. Messages in Kinesis are handled in parallel and they will be extracted in batches by a Kinesis client. So for best performance and consistency aim for short, independent and self-contained messages. If ordering matters or if the total message is too large for Kinesis then you can send extra properties with the messages such as “Index” and “Total” to indicate the order like “1 of 10”, “2 of 10” etc. so that the client can collect and sort them.

The shard iterator

Insert the following private method to Program.cs:

private static void ReadFromStream()
{
	AmazonKinesisConfig config = new AmazonKinesisConfig();
	config.RegionEndpoint = Amazon.RegionEndpoint.EUWest1;
	AmazonKinesisClient kinesisClient = new AmazonKinesisClient(config);
	String kinesisStreamName = ConfigurationManager.AppSettings["KinesisStreamName"];

	DescribeStreamRequest describeRequest = new DescribeStreamRequest();
	describeRequest.StreamName = kinesisStreamName;

	DescribeStreamResponse describeResponse = kinesisClient.DescribeStream(describeRequest);
	List<Shard> shards = describeResponse.StreamDescription.Shards;

	foreach (Shard shard in shards)
	{
		GetShardIteratorRequest iteratorRequest = new GetShardIteratorRequest();
		iteratorRequest.StreamName = kinesisStreamName;
		iteratorRequest.ShardId = shard.ShardId;
		iteratorRequest.ShardIteratorType = ShardIteratorType.TRIM_HORIZON;

		GetShardIteratorResponse iteratorResponse = kinesisClient.GetShardIterator(iteratorRequest);
		string iteratorId = iteratorResponse.ShardIterator;

		while (!string.IsNullOrEmpty(iteratorId))
		{
			GetRecordsRequest getRequest = new GetRecordsRequest();
			getRequest.Limit = 1000;
			getRequest.ShardIterator = iteratorId;

			GetRecordsResponse getResponse = kinesisClient.GetRecords(getRequest);
			string nextIterator = getResponse.NextShardIterator;
			List<Record> records = getResponse.Records;

			if (records.Count > 0)
			{
				Console.WriteLine("Received {0} records. ", records.Count);
				foreach (Record record in records)
				{
					string json = Encoding.UTF8.GetString(record.Data.ToArray());
					Console.WriteLine("Json string: " + json);
				}
			}
			iteratorId = nextIterator;
		}
	}
}

Let’s see what’s going on here. The first 4 lines are identical to what we had in the Kinesis producer: we simply configure the access to Kinesis. We use the Kinesis client object to describe the Kinesis stream referred to by its name in the DescribeStreamRequest object. We then extract the available shards in the stream.

We then iterate through the shards. For each shard – we have only one – we need to request a shard iterator. A shard iterator will help us iterate through the messages in the shard. We specify where we want to start using the ShardIteratorType enumeration. TRIM_HORIZON means that we want to start with the oldest message first and work our way up from there. This is like a first-in-first-out collection and is probably the most common way to extract the messages. Other enumeration values are the following:

AT_SEQUENCE_NUMBER: read from the position indicated by a sequence number
AFTER_SEQUENCE_NUMBER: start right after the sequence number
LATEST: always read the most recent data in the shard

If you recall from the previous post a sequence number is an ID attached to each message.

Once we get the iterator we extract its ID which is used in the GetRecordsRequest object. Note that we enter a while loop and check if the iterator ID is null or empty. The GetRecordsResponse will also include an iterator ID which is a handle to read any subsequent messages. This will normally be an endless loop allowing us to always listen to messages from the stream. If there are any records returned by the iterator we print the number of records and the pure string data of each record. We expect to see some JSON messages. We don’t yet parse them to our WebTransaction messages, we’ll continue with processing the raw data in the next post.

Call this method from Main:

static void Main(string[] args)
{
	ReadFromStream();

	Console.WriteLine("Main done...");
	Console.ReadKey();
}

Test

Let’s see this in action. Make AmazonKinesisCustomer the start-up project of the solution and start the application. If you followed the previous post of this series within 24 hours of completing this post then you should see the messages you sent to the Kinesis stream before – recall that Kinesis keeps the messages for 24 hours. I can see the following JSON messages:

Keep the application running. You’ll see that the loop just continues to run and the application doesn’t stop – we’re effectively waiting for new messages from the sream. Back in VS right-click AmazonKinesisProducer, select Debug, Start new instance. You’ll have two console windows up and running:

Enter a couple of new web transactions into the producer and send it to Kinesis. The client should fetch them in a couple of seconds:

Great, we have now a highly efficient cloud-based message handler in form of Amazon Kinesis, a Kinesis client and a Kinesis producer. We’ve also seen that although the stream is located in the cloud, the producers and clients can be virtually any platforms that are able to handle HTTP messages. Therefore don’t get bogged down by the thought that you have to use Amazon components with Kinesis.

In the next post we’ll add some validation to the incoming messages.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, kinesis

Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 5

December 20, 2014 Leave a comment

Introduction

In the previous post we extended our Amazon Big Data design with Amazon RedShift. In this post we’ll look at some data storage options for the aggregated data. Also, we’ll look at an option on how to keep this flow running, i.e. how to make sure that aggregation happens at regular intervals automatically.

Storing aggregated data

Say that the aggregation mechanism has finished the aggregations that are interesting to your business. The aggregated data must also be stored somewhere that’s accessible to other systems. By “other systems” I mean applications where the end users can view the aggregations and run other queries, like a web site with lots of nice graphs. The input data for those graphs should be easily accessible from a data store.

For Elastic MapReduce you have 2 options at the time of writing this post. You can either export the aggregated data back to S3 or to DynamoDb. We’ve seen S3 before in this series but DynamoDb is new.

DynamoDb is Amazon’s cloud-based NoSql database. If you worked with databases like MongoDb or RavenDb before then you’ll see that DynamoDb is similar. There are no schemas, you can store any type of unstructured data in DynamoDb documents. I’d say that DynamoDb provides better structuring and the usual CRUD mechanisms are better supported than in S3. If you’d like to save your aggregated data from EMR somewhere else, like RedShift or Amazon Relational Database Service (RDS) then you’ll have to do it indirectly: save the data in S3 first and then export it to RedShift or RDS from S3 via some automation service like Amazon Import/Export or Amazon Data Pipeline – more on Data Pipeline below.

If you use RedShift as the aggregation mechanism then RedShift provides excellent PostgreSQL-based tabular storage as well so there’s probably no need to look any further. You can probably export the aggregation results to DynamoDb or S3 but I don’t really see the point. RedShift tables are easily accessible for a wide range of technologies: .NET, Java, Python etc.

Let’s extend our architecture diagram with the storage mechanisms:

Automation

By automation I mean the automation of the aggregation jobs. You’ll probably want the aggregation job to run at defined intervals, say every 15 minutes. At the same time you might want to start an ad-hoc aggregation job outside the automation interval. One Amazon-based option is the following setup:

Build a “normal” Java application that goes through the application process by way of calling he aggregation mechanism – EMR or RedShift – to run one or more aggregation scripts
Compile the Java app into a JAR file
Save the JAR file in S3
Let the JAR file be executed by another Amazon service called Data Pipeline using a shell script (.sh) which is also stored in S3

AWS Data Pipeline (DP) is an automation tool that can run a variety of job types – or activities as they are called in DP. DP can execute jobs at intervals or just once, log the result, re-try failed jobs and much more. If you decide to try this solution then ShellCommandActivity is the activity type you’re looking for. I won’t provide any details here how to set up ShellCommandActivity here as this blog post is entirely dedicated to that.

Let’s add DP to our diagram:

Extensions

We’re actually done with the core of our Big Data system. However, it can be extended in numerous ways, here are some examples:

Accessing Kinesis requires that you provide your security credentials in the Kinesis producer, i.e. the application that sends the raw data messages to Kinesis. E.g. if you’re collecting the response times from a HTML page then the underlying JS file will need to include your credentials which makes your Amazon account very vulnerable. An option to alleviate the problem is to set up a public web page in front of Kinesis, like a web service. This service can then itself forward the message to Kinesis. Another option is to set up temporary credentials for the producers. This page describes how to do that with the AWS Security Token Service.
Amazon has an efficient in-memory caching solution called ElastiCache. The aggregation mechanism could potentially save the aggregated data in the data store and also push it to the cache. The consuming application will then first consult the cache instead of the database to ease the load

I have another blog series dedicated to Big Data with .NET where I go through some of these components in greater detail and a lot of code examples.

This post concludes this series. I hope you’ve learnt a lot of good stuff.

View all posts related to Amazon Web Services here.

Filed under Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data

Big Data: using Amazon Kinesis with the AWS.NET API Part 3: sending to the stream

December 18, 2014 5 Comments

Introduction

In the previous post of this series we set up the Kinesis stream, installed the .NET SDK and inserted a very simple domain object into a Kinesis producer console application.

In this post we’ll start posting to our Kinesis stream.

Open the AmazonKinesisProducer demo application and let’s get to it.

Preparations

Amazon.Util.ProfileManager.RegisterProfile("demo-aws-profile", "your access key id", "your secret access key");

<appSettings>
        <add key="AWSProfileName" value="demo-aws-profile"/>
	<add key="KinesisStreamName" value="test-stream"/>
</appSettings>

Generating web transactions

We’ll create web transaction objects using the console. Add the following private methods to Program.cs:

private static List<WebTransaction> GetTransactions()
{
	List<WebTransaction> webTransactions = new List<WebTransaction>();
	Console.WriteLine("Enter your web transactions. ");
	Console.Write("URL - type 'x' and press Enter to exit: ");
	string url = Console.ReadLine();
	while (url != "x")
	{
		WebTransaction wt = new WebTransaction();
		wt.Url = url;
		wt.UtcDateUnixMs = ConvertToUnixMillis(DateTime.UtcNow);

		Console.Write("Customer name: ");
		string customerName = Console.ReadLine();
		wt.CustomerName = customerName;

		Console.Write("Response time (ms): ");
		int responseTime = Convert.ToInt32(Console.ReadLine());
		wt.ResponseTimeMs = responseTime;

		Console.Write("Web method: ");
		string method = Console.ReadLine();
		wt.WebMethod = method;

		webTransactions.Add(wt);

		Console.Write("URL - enter 'x' and press enter to exit: ");
		url = Console.ReadLine();
	}
	return webTransactions;
}

private static long ConvertToUnixMillis(DateTime dateToConvert)
{
	return Convert.ToInt64(dateToConvert.Subtract(new DateTime(1970,1,1,0,0,0,0)).TotalMilliseconds);
}

GetTransactions() is a simple loop you must have done in your C# course #2 or 3. Note that I haven’t added any validation, such as the feasibility of the web method or the response time. So be gentle and enter “correct” values later on during the tests. ConvertToUnixMillis simply converts a date to a UNIX timestamp in milliseconds. .NET4.5 doesn’t natively support UNIX dates but it’s coming in C# 6.

Sending the transactions to the stream

We’ll send each message one by one in the following method which you can add to Program.cs:

private static void SendWebTransactionsToQueue(List<WebTransaction> transactions)
{
	AmazonKinesisConfig config = new AmazonKinesisConfig();
	config.RegionEndpoint = Amazon.RegionEndpoint.EUWest1;
	AmazonKinesisClient kinesisClient = new AmazonKinesisClient(config);
	String kinesisStreamName = ConfigurationManager.AppSettings["KinesisStreamName"];

	foreach (WebTransaction wt in transactions)
	{
		string dataAsJson = JsonConvert.SerializeObject(wt);
		byte[] dataAsBytes = Encoding.UTF8.GetBytes(dataAsJson);
		using (MemoryStream memoryStream = new MemoryStream(dataAsBytes))
		{
			try
			{						
				PutRecordRequest requestRecord = new PutRecordRequest();
				requestRecord.StreamName = kinesisStreamName;
				requestRecord.PartitionKey = "url-response-times";
				requestRecord.Data = memoryStream;

				PutRecordResponse responseRecord = kinesisClient.PutRecord(requestRecord);
				Console.WriteLine("Successfully sent record {0} to Kinesis. Sequence number: {1}", wt.Url, responseRecord.SequenceNumber);
			}
			catch (Exception ex)
			{
				Console.WriteLine("Failed to send record {0} to Kinesis. Exception: {1}", wt.Url, ex.Message);
			}
		}
	}
}

You’ll need to reference the System.Configuration library to make this work.

We first configure our access to Kinesis using the AmazonKinesisConfig object. We set the region to the one where we set up the stream. In my case it’s eu-west-1, but you may need to provide something else. We also read the stream name from app.config.

Then for each of the WebTransaction objects we go through the following process:

Get the JSON representation of the object
Convert the JSON to a byte array
Put byte array into a MemoryStream
We set up the PutRecordRequest object with the stream name, the partition key and the data we want to publish
The record is sent to Kinesis using the PutRecord method
If it’s successful then we print the sequence number of the message
Otherwise we print an exception message

What is a partition key? It is a key to group the data within a stream into shards. And a sequence number? It is a unique ID that each message gets upon insertion into the stream. This page with the key concepts will be a good friend of yours while working with Kinesis.

Test

We can call these functions from Main as follows:

List<WebTransaction> webTransactions = GetTransactions();
SendWebTransactionsToQueue(webTransactions);

Console.WriteLine("Main done...");
Console.ReadKey();

Start the application and create a couple of WebTransaction objects using the console. Then if all goes well you should see a printout similar to the following in the console window:

Let’s see what the Kinesis dashboard is telling us:

The PutRequest graph increased to 5 – and since I then put one more message to the stream the number decreased to 1:

In the next post we’ll see how to read the messages from the stream.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, kinesis

Using Amazon Kinesis with the AWS.NET API Part 2: stream, NET SDK and domain setup

December 15, 2014 1 Comment

Introduction

In the previous post we went through an introduction of Amazon Kinesis. We established that Kinesis is an ideal out-of-the-box starting point for your Big Data analysis needs. It takes a lot of burden off your shoulders regarding scaling, maintenance and redundancy. We also said that Kinesis only provided a 24-hour storage of the messages so we’ll need to build an application, a Kinesis Client, that will ultimately process the messages in some way: filtering, sorting, saving etc.

In this post we’ll create our Kinesis stream and install the AWS SDK.

Creating the stream

Log onto the AWS console and locate the Kinesis service:

Probably every service you use with AWS has a region that you can select in the top right section of the UI:

These regions are significant for the services with a couple of exceptions. E.g. S3, which we’ll discuss in the next series, is global and has less regional significance. In the case of Kinesis when you create a new stream then that stream will be available in the selected region. It doesn’t, however, mean that users cannot send messages to a stream in Ireland from Australia. However, it will take Australian users a bit more time to send messages to this stream than it does for a user in the UK. Also, we’ll see later that the region must be specified in code when configuring the access to AWS otherwise you may be wondering why your stream cannot be located.

You can create a new stream with the Create Stream button:

Note that Kinesis has at the time of writing this post no free-tier pricing. According to the current pricing table example it costs about $4.22 a day to process 1000 messages per second where each message is 5KB in size. We will only test with some individual messages in this series so the total cost should be minimal.

Enter a stream name and set the number of shards to 1, that will be enough for testing:

Press “Create” and you’ll be redirected to the original screen with the list of streams. Your new stream should be in “CREATING” status:

…which will shortly switch to “ACTIVE”.

You can click the name of the stream which will open a screen with a number of performance indicators:

We haven’t processed any messages yet so there are no put or get requests yet.

That’s it, we have a functioning Kinesis stream up and running. Let’s move on.

Installing the SDK

The Amazon .NET SDK is available through NuGet. Open Visual Studio 2012/2013 and create a new C# console application called AmazonKinesisProducer. The purpose of this application will be to send messages to the stream. In reality the message producer could by any type of application:

A website
A Windows/Android/iOS app
A Windows service
A traditional desktop app

…i.e. any application that’s capable of sending HTTP/S PUT requests to a service endpoint. We’ll keep it simple and not waste time with view-related tasks.

Install the following NuGet package:

We’ll also be working with JSON data so let’s also install the popular NewtonSoft Json package as well:

Domain

In this section we’ll set up the data structure of the messages we’ll be processing. I’ll reuse a simplified version of the messages we had in a real-life project similar to what we’re going through. We’ll pretend that we’re measuring the total response time of web pages that our customers visit.

A real-life solution would involve a JavaScript solution embedded into the HTML of certain pages. That JavaScript will collect data like “transaction start” and “transaction finish” which make it possible to measure the response time of a web page as it’s experienced by a real end user. The JavaScript will then send the transaction data to a web service as JSON.

In our case of course we’ll not go through all that. We’ll pre-produce our data points using a C# object and JSON.

Insert the following class into the Kinesis producer app:

public class WebTransaction
{
	public long UtcDateUnixMs { get; set; }
	public string CustomerName { get; set; }
	public string Url { get; set; }
	public string WebMethod { get; set; }
	public int ResponseTimeMs { get; set; }
}

Dates are easiest to handle as UNIX timestamps in milliseconds as most systems will be able to handle it. DateTime in .NET4.5 doesn’t have any built-in support for UNIX timestamps but that’s easy to solve. Formatted date strings are more difficult to parse so we won’t go with that. You’ll probably understand the purpose of the other properties.

We’ll start sending message to our stream in the next post.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, kinesis

Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 4

December 14, 2014 Leave a comment

Introduction

In the previous post we looked at two components in our architecture:

Elastic Beanstalk to host the Java Kinesis Client Application
Elastic MapReduce, a Hadoop-based Amazon component as an alternative data aggregation platform

In this post we’ll look at another AWS component that very well suits data aggregation jobs: RedShift.

Amazon RedShift

Amazon RedShift is Amazon’s data warehousing solution. It follows a columnar DBMS architecture and it was designed especially for heavy data mining requests.

RedShift is based on PostgreSQL with some Amazon specific additions e.g. for importing raw data values from and S3 bucket. PostgreSQL syntax is very similar to other SQL languages you may be familiar with such as MS SQL. Therefore learning the basics of the language doesn’t require a lot of time and effort and you can become productive quite fast.

Having said all that, here comes a warning – a serious warning. The version of PostgreSQL employed on RedShift has some serious limitations compared to the full blown PostgreSQL. E.g. stored procedures, triggers, functions, auto-incrementing primary keys, enforced secondary keys etc. are NOT SUPPORTED. RedShift is still optimised for aggregation functions but it’s a good idea to be aware of the limits. This page has links to the lists of all missing or limited features.

RedShift vs. Elastic MapReduce

You may be asking which of the two aggregation mechanism is faster, EMR with Hive or RedShift with PostgreSQL. According to the tests we’ve performed in our project RedShift aggregation jobs run faster. A lot faster. RedShift can also be used as the storage device for the aggregated data. Other platforms like web services or desktop apps can easily pull the data from a RedShift table. You can also store the aggregated data on EMR on the Hadoop file system but those are not as readily available to other external platforms. We’ll look at some more storage options in the next part of this series so I won’t give any more details here.

This doesn’t mean that EMR is completely out of game but if you’re facing a scenario such as the one described in this series then you’re probably better off using RedShift for the aggregation purposes.

DB schema for RedShift

Data warehousing requires a different mindset to what you might be accustomed to from your DB-driven development experience. In a “normal” database of a web app you’ll probably have tables according to your domain like “Customer”, “Product”, “Order” etc. Then you’ll have secondary keys and intermediate tables to represent 1-to-M and M-to-M relationships. Also, you’ll probably keep your tables normalised.

That’s often simply not good enough for analytic and data mining applications. In data analysis apps your customers will be after some complex aggregation queries:

What was the maximum response time of /Products.aspx on my web page for users in Seattle using IE 10 on January 12 2015?
What was the average sales of product ‘ABC’ in our physical shops in the North-America region between 01 and 20 December 2014?
What is the total value of product XYZ sold on our web shop from iOS mobile apps in France in February 2015 after running a campaign?

You can replace “average” and “maximum” with any other aggregation type such as the 95th-percentile and median. With so many aggregation combinations your aggregation scripts would need to go through a very long list of aggregations. Also, trying to save every thinkable aggregation combination in different tables would cause the number of tables to explode. Such a setup will require a lot of lookups and joins which greatly reduce the performance. We haven’t even mentioned the difficulty with adding new aggregation types and storing historical data.

Data mining applications have adopted two schema types specially designed to solve this problem:

Star schema: a design with a Fact table in the middle and one or more Dimension tables around. The dimension tables are often denormalised
Snowflake schema: very similar to a star schema but the dimension tables are normalised. Therefore the fact table is surrounded by the dimension tables and their own broken-out sub-tables

I will not even attempt to describe these schema types here as the post – and the series – would explode with stuff that’s out of scope. I just wanted to make you aware of these ideas. Here’s an example for each type from Wikipedia to give you a taste.

Star:

Snowflake:

They are often used by analytic applications such as SQL Server Analysis Services. If you’re planning to take on data mining at a serious level then it’s inevitable to get accustomed with them.

Of course, if you’re only planning to support some basic aggregation types then such schema designs may be overkill. It all depends on your goals.

RedShift is very well suited for both Star and Snowflake schema types. There’s long article that goes through Star and Snowflake in RedShift available here.

Let’s add RedShift to our diagram as another alternative:

In the next post – which will finish up this series – we’ll look into potential storage mechanisms for both RedShift and EMR.

View all posts related to Amazon Web Services here.

Filed under Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data

← Older posts

Newer posts →

Exercises in .NET with Andras Nemes

Using Amazon S3 with the AWS.NET API Part 3: code basics cont’d

Using Amazon S3 with the AWS.NET API Part 2: code basics

Using Amazon S3 with the AWS.NET API Part 1: introduction

Using Amazon Kinesis with the AWS.NET API Part 6: storage

Using Amazon Kinesis with the AWS.NET API Part 5: validation

Big Data: using Amazon Kinesis with the AWS.NET API Part 4: reading from the stream

Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 5

Big Data: using Amazon Kinesis with the AWS.NET API Part 3: sending to the stream

Using Amazon Kinesis with the AWS.NET API Part 2: stream, NET SDK and domain setup

Architecture of a Big Data messaging and aggregation system using Amazon Web Services part 4

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Blogs I Follow

Share:

Share:

Share:

Share:

Share:

Share:

Share:

Share:

Share:

Share:

My profile

Verified Services

Follow my blog via email

Top Posts & Pages

History

Keywords

Blogs I Follow