Using Amazon S3 with the AWS.NET API Part 6: S3 in Big Data II

Introduction

In the previous post we discussed how S3 can be incorporated as a storage mechanism in our overall Big Data infrastructure. We laid out the strategy for naming and storing raw data files and folders. In this post we’ll implement this strategy in the Amazon Kinesis client demo app which we worked on in this series on Amazon Kinesis.

Have that Kinesis client demo app open in Visual Studio and let’s get to work!

Preparation

Log onto S3 and create a top bucket where all raw data files will be saved. Let’s call the bucket “raw-urls-data”.

We’ll add a couple of new methods to WebTransaction.cs of AmazonKinesisConsumer that will be used during the S3 persistence process.

Insert this private method to convert a Unix timestamp to a .NET DateTime – this will be available in C# 6 by default but we’ll need to survive with such workarounds for the time being:

private DateTime UnixTimeStampToDateTime(long unixTimeMillis)
{
	DateTime epoch = new DateTime(1970, 1, 1, 0, 0, 0, 0, DateTimeKind.Utc);
	DateTime converted = epoch.AddMilliseconds(unixTimeMillis);
	return converted;
}

Add the following new property for easy access to the DateTime equivalent of the data point observation date:

public DateTime ObservationDateUtc
{
	get
	{
		return UnixTimeStampToDateTime(UtcDateUnixMs);
	}
}

Update the ToTabDelimitedString() method to the following:

public string ToTabDelimitedString()
{
	StringBuilder sb = new StringBuilder();
	string delimiter = "\t";
	sb.Append(CustomerName)
		.Append(delimiter)
		.Append(Url)
		.Append(delimiter)
		.Append(WebMethod)
		.Append(delimiter)
		.Append(ResponseTimeMs)
		.Append(delimiter)
		.Append(UtcDateUnixMs)
		.Append(delimiter)
		.Append(ObservationDateUtc);
	return sb.ToString();
}

The above modification adds the human readable UTC date to the stringified view of the raw data point.

Let’s also add one more method to the class that will form the observation date in the yyyy-MM-dd-HH-mm format. If you recall the raw data file name will follow this date format preceded by the customer name:

public string FormattedObservationDateMinutes()
{
	String dateFormat = "yyyy-MM-dd-HH-mm";			
	return ObservationDateUtc.ToString(dateFormat);
}

We’re done with the preparations.

Strategy implementation

In the Kinesis client app we had two console applications: AmazonKinesisConsumer and AmazonKinesisProducer. Insert a new class called AmazonS3DataStorage in AmazonKinesisConsumer with the following stub implementation:

public class AmazonS3DataStorage : IRawDataStorage
{
	public void Save(IEnumerable<WebTransaction> webTransactions)
	{
		throw new NotImplementedException();
	}
}

Let’s build up this class step by step. First we’ll need to inject the top bucket name so insert the following private variable and a constructor:

private readonly string _topBucketName;

public AmazonS3DataStorage(String topBucketName)
{
	if (String.IsNullOrEmpty(topBucketName)) throw new ArgumentNullException("S3 bucket name cannot be empty!");
	_topBucketName = topBucketName;
}

We’ll need an object to contact S3. We’ve seen this method in the S3 series, let’s re-use it:

private IAmazonS3 GetAmazonS3Client()
{
	return Amazon.AWSClientFactory.CreateAmazonS3Client(RegionEndpoint.EUWest1);
}

We’ll group the web transactions according to customer and date where the date is represented at the level of minutes. The key of the grouping – a dictionary – will be the file name format customer-year-month-day-hour-minute. The value of the dictionary will be the web transactions the customer made in that minute. The below method will create that dictionary:

private Dictionary<string, List<WebTransaction>> GroupRecordsPerCustomerAndDate(IEnumerable<WebTransaction> allWebTransactions)
{
	Dictionary<string, List<WebTransaction>> group = new Dictionary<string, List<WebTransaction>>();
	foreach (WebTransaction wt in allWebTransactions)
	{
		String key = string.Concat(wt.CustomerName, "-", wt.FormattedObservationDateMinutes());
		if (group.ContainsKey(key))
		{
			List<WebTransaction> transactionsInGroup = group[key];
			transactionsInGroup.Add(wt);
		}
		else
		{
			List<WebTransaction> transactionsInGroup = new List<WebTransaction>();
			transactionsInGroup.Add(wt);
			group[key] = transactionsInGroup;
		}
	}

	return group;
}

It will be easier to follow what all these methods do later when we test the whole chain.

The following two private helper methods will build the folder name according to the folder name format we discussed in the previous post. Recall that these subfolders will represent intervals of 30 minutes:

private string BuildContainingFolderName(WebTransaction webTransaction)
{
	DateTime observationDate = webTransaction.ObservationDateUtc;
	int year = observationDate.Year;
	string monthString = FormatDateUnitWithLeadingZeroes(observationDate.Month);
	string dayString = FormatDateUnitWithLeadingZeroes(observationDate.Day);
	string hourString = FormatDateUnitWithLeadingZeroes(observationDate.Hour);
	int minuteInterval = GetMinuteInterval(observationDate.Minute);
	string minuteIntervalString = FormatDateUnitWithLeadingZeroes(minuteInterval);
	string folderNameDelimiter = "-";
	return string.Concat(minuteIntervalString, folderNameDelimiter, hourString, folderNameDelimiter
		, dayString, folderNameDelimiter, monthString, folderNameDelimiter, year);
}

private string FormatDateUnitWithLeadingZeroes(int dateUnit)
{
	String formatted = dateUnit < 10 ? string.Concat("0", dateUnit) : dateUnit.ToString();
	return formatted;
}

private int GetMinuteInterval(int minute)
{
	int res = 0;
	if (minute > 29)
	{
		res = 30;
	}
	return res;
}

The following helper method will build the tab delimited file content of a list of web transactions:

private String BuildRawDataFileContent(List<WebTransaction> webTransactions)
{
	StringBuilder recordBuilder = new StringBuilder();
	int size = webTransactions.Count;
	for (int i = 0; i < size; i++)
	{
		recordBuilder.Append(webTransactions[i].ToTabDelimitedString());
		if (i < size - 1)
		{
			recordBuilder.Append(Environment.NewLine);
		}
	}
	return recordBuilder.ToString();
}

…and this one will save the web transaction objects in an S3 folder:

private void SaveWebTransactionsInFolder(string folderName, string fileKey, List<WebTransaction> webTransactionsInFile)
{
	string fileContents = BuildRawDataFileContent(webTransactionsInFile);
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			PutObjectRequest putObjectRequest = new PutObjectRequest();
			putObjectRequest.ContentBody = fileContents;
			String delimiter = "/";
			putObjectRequest.BucketName = string.Concat(_topBucketName, delimiter, folderName);
			putObjectRequest.Key = fileKey;
			PutObjectResponse putObjectResponse = s3Client.PutObject(putObjectRequest);
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Failed to save the raw data observations in S3.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

There’s only one method missing before completing the Save method. This helper method will check if the required folder is present in the top bucket. If yes then it will call the above method to save the raw data. Otherwise the method will create the folder first and then save the raw data. Insert the following method:

private void SaveInS3(String fileKey, List<WebTransaction> webTransactions)
{
	if (!fileKey.EndsWith(".txt"))
	{
		fileKey += ".txt";
	}
	WebTransaction first = webTransactions.First();
	string containingFolder = BuildContainingFolderName(first);
	//check if folder exists
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			ListObjectsRequest findFolderRequest = new ListObjectsRequest();
			findFolderRequest.BucketName = _topBucketName;
			findFolderRequest.Delimiter = "/";
			findFolderRequest.Prefix = containingFolder;
			ListObjectsResponse findFolderResponse = s3Client.ListObjects(findFolderRequest);
			List<string> commonPrefixes = findFolderResponse.CommonPrefixes;
			if (commonPrefixes.Any())
			{
				SaveWebTransactionsInFolder(containingFolder, fileKey, webTransactions);
			}
			else //need to create S3 bucket first
			{
				PutObjectRequest folderRequest = new PutObjectRequest();						
				folderRequest.BucketName = _topBucketName;
				string delimiter = "/";
				String folderKey = string.Concat(containingFolder, delimiter);
				folderRequest.Key = folderKey;
				folderRequest.InputStream = new MemoryStream(new byte[0]);
				PutObjectResponse folderResponse = s3Client.PutObject(folderRequest);
				SaveWebTransactionsInFolder(containingFolder, fileKey, webTransactions);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Folder existence check or folder creation has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

We can now complete the body of the Save method:

public void Save(IEnumerable<WebTransaction> webTransactions)
{
	Dictionary<string, List<WebTransaction>> groups = GroupRecordsPerCustomerAndDate(webTransactions);
	foreach (var kvp in groups)
	{
		try
		{
			if (groups.Values.Any())
			{
				SaveInS3(kvp.Key, kvp.Value);
			}
		}
		catch (Exception ex)
		{
			Console.WriteLine(string.Concat("Failed to write web transactions in ", kvp.Key, " to the S3 bucket"));
		}
	}
}

We group the observations and then save the raw data for each group.

Test

Currently we save the raw data to a file using the FileBasedDataStorage implementation of IRawDataStorage in Program.cs of AmazonKinesisConsumer within the ReadFromStream method. Replace it with the following:

IRawDataStorage rawDataStorage = new AmazonS3DataStorage("raw-urls-data");

Start AmazonKinesisProducer and send several URL observations to Kinesis. Make sure to send at least 2 for the same customer to check if they are really saved in the same file. I’m sending the following data points:

Test values to be sent for Kinesis connected with Amazon S3

Start the AmazonKinesisConsumer application and the data points should be caught and processed. It’s a good idea to put break points here and there to be able to follow what’s happening exactly. In my case the data points ended up in S3 as follows:

Test data points as seen in Amazon S3

The above image shows other test values as well but the data points shown in the above console image all ended up in the 00-21-12-12-2014 folder, i.e. their observation date properties all lie within the 21:00 – 21:30 interval of December 12 2014. Let’s see the contents of that folder:

Specific data points in storage files in Amazon S3

You can see that the file naming convention follows what we have planned: customer-year-month-day-hour-minute UTC. The interpretation of the top item, i.e. cnn-2014-12-12-21-18.txt is that customer CNN made the following calls during 21:18, i.e. from 21:18:00 to 21:18:59 on December 12 2014. The actual calls are shows within the file as follows:

cnn http://www.cnn.com GET 135 1418419114682 2014-12-12 21:18:34
cnn http://www.cnn.com/africa GET 1764 1418419125964 2014-12-12 21:18:45

You can find both data points in the screenshot above.

Current status

Currently we have the working skeleton project of the following Big Data system:

  • A message handling system that can process a very large amount of messages per second, i.e Amazon Kinesis
  • Code that can accept, validate and store the raw data points in a human and machine readable format
  • A durable and flexible storage mechanism in Amazon S3 where humans can find individual data points and external systems can extract data from

Possible improvements

Apart from obvious software engineering considerations, such as SOLID, DRY, layering etc. we can introduce a couple improvements to the code, e.g.:

  • The code checks in every loop if a specific folder exists in S3. This can be sped up by putting the folder names created into the cache and first check the contents of the cache. The save process must be as quick as possible and checking for the presence of a folder unnecessarily slows down the performance
  • Don’t assume that you can anticipate the order of the messages coming in from Kinesis. They are sent in batches and it is possible that two data points that should be saved in the same raw data file in S3 will come in two different batches. The result will be that the code will throw an exception when the second message is saved as there’s already a file with the required name. You could check for the existence of the file and extend its contents but that really slows down the performance. There’s no built-in update mechanism in S3 – you have to download the contents of the file, append the new content to it, delete the original file and upload a new one. Instead we can go with an extended naming convention for the raw data files and add a unique identifier to each batch from Kinesis, like a GUID or something similar. We’ll then go with the file name format of customer-year-month-day-hour-minute-batchnumber.txt.

Next step

This post concludes our discussion of Amazon S3 in .NET. The next series will look at another storage mechanism offered by Amazon: DynamoDb which is Amazon’s take on the NoSql world. We’ll discuss this database solution and see how it can fit into a cloud based Big Data architecture.

View all posts related to Amazon Web Services and Big Data here.

Advertisement

Using Amazon S3 with the AWS.NET API Part 5: S3 in Big Data

Introduction

In the previous post we looked at how to work with Amazon S3 folders in code.

This post will take up the Big Data thread where we left off at the end of the previous series on the scalable and powerful message handling service Amazon Kinesis. Therefore the pre-requisite of following the code examples in this post is familiarity with what we discussed there. However, I’ll try to write in a way that those of you who’ve only come here for S3 may get a taste of the role it can play in a cloud-based Big Data architecture. Who knows, you might just learn something useful.

I’ve split the discussion into 2 parts. In this post we’ll decide on our data storage strategy without writing any code. We’ll implement the strategy in the next post.

Reminder

Where were we in the first place? The last post of the Amazon Kinesis series stopped where we stored our raw data in a text file on our local drive like this:

yahoo http://www.yahoo.com GET 432 1417556120657
google http://www.google.com POST 532 1417556133322
bbc http://www.bbc.co.uk GET 543 1417556148276
twitter http://www.twitter.com GET 623 1417556264008
wiki http://www.wikipedia.org POST 864 1417556302529
facebook http://www.facebook.com DELETE 820 1417556319381

We hid the relevant function behind an interface:

public interface IRawDataStorage
{
	void Save(IEnumerable<WebTransaction> webTransactions);
}

Currently we have one implementation of the interface: FileBasedDataStorage. We’ll now use our new S3 skills and create a new implementation.

Strategy

We’ll give some structure to our S3 URL transaction data store. If we dump all observation into the same bucket then it will be difficult for both humans and software to search and aggregate the data.

Let’s recap what we discussed regarding the raw data and why we’re saving it in a tab-delimited format:

The format will most likely depend on the mechanism that will eventually pull data from the raw data store. Data mining and analysis solutions such as Amazon RedShift or Elastic MapReduce (EMR) – which we’ll take up later on – will all need to work with the raw data. So at this stage you’ll need to do some forward thinking:

  • A: What mechanism will need to read from the raw data store for aggregation?
  • B: How can we easily – or relatively easily – read the raw data visually by just opening a raw data file?

B is important for debugging purposes if you want to verify the calculations. It’s also important if some customer is interested in viewing the raw data for some time period. For B you might want to store the raw data as it is, i.e. as JSON. E.g. you can have a text file with the following data points:

{"CustomerId": "abc123", "DateUnixMs": 1416603010000, "Activity": "buy", "DurationMs": 43253}
{"CustomerId": "abc123", "DateUnixMs": 1416603020000, "Activity": "buy", "DurationMs": 53253}
{"CustomerId": "abc123", "DateUnixMs": 1416603030000, "Activity": "buy", "DurationMs": 63253}
{"CustomerId": "abc123", "DateUnixMs": 1416603040000, "Activity": "buy", "DurationMs": 73253}

…i.e. with one data point per line.

However, this format is not suitable for point A above. Other mechanisms will have a hard time understanding this data format. For RedShift and EMR to work most efficiently we’ll need to store the raw data in some delimited fields such as CSV or tab delimited fields. So the above data points will then be stored as follows in a tab-delimited file:

abc123     1416603010000    buy    43253
abc123     1416603020000    buy    53253
abc123     1416603030000    buy    63253
abc123     1416603040000    buy    73253

This is probably OK for point B above as well. It’s not too hard on your eyes to understand this data structure so we’ll settle for that. You might ask why we didn’t select some other delimiter, such as a pipe ‘|’ or a comma ‘,’. The answer is that our demo system is based on URLs and URLs can have pipes and commas in them making them difficult to split. Tabs will work better but you are free to choose whatever fits your system best.

File organisation refinements

Say we’d like to find out the average response time of yahoo.com over January 2014. If we put all raw data points in the same bucket with no folders then it will be difficult and time consuming to find the correct data points for data analysis. I took up this topic in another Amazon Big Data related series on this blog. I’ll copy the relevant considerations here with some modifications.

So the question now is how we actually organise our raw data files into buckets and folders. Again, we’ll need to consider points A and B above. In addition you’ll need to consider the frequency of your data aggregations: once a day, every hour, every quarter?

You might first go for a customer ID – or some other ID – based grouping, so e.g. you’ll have a top bucket and sub folders for each customer:

  • Top bucket: “raw-data-points”
  • Subfolder: “customer123”
  • Subfolder: “customer456”
  • Subfolder: “customer789”
  • …etc.

…and within each subfolder you can have subfolders based on dates in the raw data points, e.g.:

  • Sub-subfolder: “2014-10-11”
  • Sub-subfolder: “2014-10-12”
  • Sub-subfolder: “2014-10-13”
  • Sub-subfolder: “2014-10-14”
  • …etc.

That looks very nice and it probably satisfies question B above but not so much question A. This structure is difficult to handle for an aggregation mechanism as you’ll need to provide complex search criteria for the aggregation. In addition, suppose you want to aggregate the data every 30 minutes and you dump all raw data points into one of those sub-subfolders. Then again you’ll need to set up difficult search criteria for the aggregation mechanism to extract just the correct raw data points.

One possible solution is the following:

  • Decide on the minimum aggregation frequency you’d like to support in your system – let’s take 30 minutes for the sake of this discussion
  • Have one dedicated top bucket like “raw-data-points” above
  • Below this top bucket organise the data points into sub folders based on dates
  • There will be only one subfolder per period within the top bucket to make data access and searches easier
  • Each subfolder will contain a number of files which hold the raw data points in a tab delimited format

The names of the date sub-folders can be based on the minimum aggregation frequency. You’ll basically put the files into intervals where the date parts are reversed according to the following format:

minute-hour-day-month-year

Examples:

  • 00-13-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 13:00:00 until 13:29:59 inclusive
  • 30-13-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 13:30:00 until 13:59:59 inclusive
  • 00-14-15-11-2014: subfolder to hold the raw data for the interval 2014 November 15, 14:00:00 until 14:29:59 inclusive

…and so on. Each subfolder can then hold text files with the raw data points. In order to find a particular storage file of a customer you can do some pre-grouping in the Kinesis client application and not just save every data point one by one in S3: group the raw data points according to the customer ID and the date of the data point and save the raw files accordingly. You can then have the following text files in S3:

  • abc123-2014-11-15-13-32-43.txt
  • abc123-2014-11-15-13-32-44.txt
  • abc123-2014-11-15-13-32-45.txt

…where the names follow this format:

customerId-year-month-day-hour-minute-second

So within each file you’ll have the CSV or tab delimited raw data that occurred in that given second. In case you want to go for a minute based pre-grouping then’ll end up with the following files:

  • abc123-2014-11-15-13-31.txt
  • abc123-2014-11-15-13-32.txt
  • abc123-2014-11-15-13-33.txt

…and so on. This is the same format as above but at the level of minutes instead.

Conclusion

Based on the above let’s go for the following strategy:

  • Minimum aggregation interval: 30 minutes
  • One top bucket
  • Subfolders will follow the rules outlined above, e.g. 00-13-15-11-2014
  • We’ll group the incoming data points by the minute: customerId-year-month-day-hour-minute

Example: if our top bucket in S3 is called “raw-data” then we’ll have the following file hierarchy:

  • raw-data bucket
  • sub-folder 00-13-15-11-2014
  • Within 00-13-15-11-2014 files like yahoo-2014-11-15-13-15.txt and facebook-2014-11-15-13-16.txt
  • Another subfolder within ‘raw-data’: 30-13-15-11-2014
  • Within 30-13-15-11-2014 files like yahoo-2014-11-15-42-15.txt and facebook-2014-11-15-43-16.txt

Keep in mind that all of the above can be customised based on your data structure. The main point is that S3 is an ideal way to store large amounts of raw data points within the Amazon infrastructure and that you’ll need to carefully think through how to organise your raw data point files so that they are easily handled by an aggregation mechanism.

We’ll implement this strategy in the next post.

View all posts related to Amazon Web Services and Big Data here.

Using Amazon S3 with the AWS.NET API Part 4: working with folders in code

Introduction

In the previous post we looked at some more basic code examples to work with Amazon S3. In the first part we saw how to create folders within a bucket in the S3 GUI. We haven’t yet seen how to create and delete folders in code and that’s the goal of this post.

We’ll extend our demo application AmazonS3Demo so have it open in Visual Studio 2012/2013.

Creating folders

So far we’ve seen that it’s fairly simple and self-explanatory to work with objects in S3: PutObjectRequest, ListObjectsRequest etc. were easy to handle. However, there’s no special C# object for folders in the SDK, like PutFolderRequest. The PutObjectRequest is used to create folders but they need to be used in a special way.

Locate S3DemoService.cs in the demo app and add the following method:

public void RunFolderCreationDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{					
			PutObjectRequest folderRequest = new PutObjectRequest();					
			String delimiter = "/";
			folderRequest.BucketName = "a-second-bucket-test";
			String folderKey = string.Concat("this-is-a-subfolder", delimiter);
			folderRequest.Key = folderKey;
			folderRequest.InputStream = new MemoryStream(new byte[0]);
			PutObjectResponse folderResponse = s3Client.PutObject(folderRequest);
                }
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Folder creation has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Most of the code looks probably familiar from the RunFileUploadDemo() method. The way to indicate that we’re intending to create a demo is to attach a “/” delimiter to the key name and set the content input stream to an empty byte array. Run this code from Main…:

static void Main(string[] args)
{
	S3DemoService demoService = new S3DemoService();
	demoService.RunFolderCreationDemo();

	Console.WriteLine("Main done...");
	Console.ReadKey();
}

…and the folder should be visible in S3:

Folder created in Amazon S3 bucket

What if you’d like to add another folder within this folder? It’s the same code as above, you just need to attach the name of the subfolder and the delimiter to the object key:

String folderKey = string.Concat("this-is-a-subfolder", delimiter, "another-subfolder",delimiter);
					folderRequest.Key = folderKey;

…and there it is:

Folder within a folder in Amazon S3 bucket

Checking if a folder exists

Say you’d like to find out whether the “this-is-a-subfolder/another-subfolder/” folder path exists. The following code snippet will first check if “this-is-a-subfolder” exists:

public void RunFolderExistenceCheckDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			ListObjectsRequest findFolderRequest = new ListObjectsRequest();
			findFolderRequest.BucketName = "a-second-bucket-test";
			findFolderRequest.Delimiter = "/";
			findFolderRequest.Prefix = "this-is-a-subfolder";
			ListObjectsResponse findFolderResponse = s3Client.ListObjects(findFolderRequest);
			List<String> commonPrefixes = findFolderResponse.CommonPrefixes;
			Boolean folderExists = commonPrefixes.Any();
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Folder existence check has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

This is similar to what we had in RunObjectListingDemo. However, we don’t check for the existence of file objects but so-called prefixes. Folder names will be part of the full path to the object and the folder names will be the prefixes in that full path. If the CommonPrefixes string list is empty then the folder doesn’t exist.

Running this code will result in “true” for the folderExists variable.

Checking whether the subfolder within the folder exists requires just a minor change:

findFolderRequest.Prefix = "this-is-a-subfolder/another-subfolder";

Again, folderExists will be true.

Adding files to folders

Let’s add a file to the folders. The following code will insert c:\logfile.txt to “this-is-a-subfolder”:

public void RunObjectInsertionToFolderDemo()
{
	FileInfo filename = new FileInfo(@"c:\logfile.txt");
	string contents = File.ReadAllText(filename.FullName);
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			PutObjectRequest putObjectRequest = new PutObjectRequest();
			putObjectRequest.ContentBody = contents;
			String delimiter = "/";
			putObjectRequest.BucketName = string.Concat("a-second-bucket-test", delimiter, "this-is-a-subfolder"); 					
			putObjectRequest.Key = filename.Name;
			PutObjectResponse putObjectResponse = s3Client.PutObject(putObjectRequest);
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("File creation within folder has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

The only difference to the RunFileUploadDemo() method we saw before is that we need to extend the bucket name with the delimiter and the folder name, i.e. the prefix. Here’s the file in S3:

File within a folder in Amazon S3 bucket

You can probably guess how to upload the file to “this-is-a-subfolder/another-subfolder/”:

putObjectRequest.BucketName = string.Concat("a-second-bucket-test", delimiter, "this-is-a-subfolder", delimiter, "another-subfolder"); 					

…and here it is:

File within a folder within a folder in Amazon S3 bucket

List objects in a folder

Listing file objects within a folder is a bit tricky as even the folders themselves are objects so they will be returned by the following query:

ListObjectsRequest listObjectsRequest = new ListObjectsRequest();
String delimiter = "/";
listObjectsRequest.BucketName = "a-second-bucket-test";
listObjectsRequest.Prefix = string.Concat("this-is-a-subfolder", delimiter, "another-subfolder");				    
ListObjectsResponse listObjectsResponse = s3Client.ListObjects(listObjectsRequest);
foreach (S3Object entry in listObjectsResponse.S3Objects)
{
	if (entry.Size > 0)
	{
		Console.WriteLine("Found object with key {0}, size {1}, last modification date {2}", entry.Key, entry.Size, entry.LastModified);
	}
}					

The S3Objects list will hold 2 objects:

this-is-a-subfolder/another-subfolder/ of size 0
this-is-a-subfolder/another-subfolder/logfile.txt of size greater than 0

Hence we test for the size being greater than 0 to find “real” files. The output will be similar to the following:

Found object with key this-is-a-subfolder/another-subfolder/logfile.txt, size 44
90, last modification date 2014-12-09 22:26:39

Deleting folders

The following method will attempt to delete the this-is-a-subfolder/another-subfolder/ folder:

public void RunFolderDeletionDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			DeleteObjectRequest deleteFolderRequest = new DeleteObjectRequest();
			deleteFolderRequest.BucketName = "a-second-bucket-test";
			String delimiter = "/";
			deleteFolderRequest.Key = string.Concat("this-is-a-subfolder", delimiter, "another-subfolder", delimiter);
			DeleteObjectResponse deleteObjectResponse = s3Client.DeleteObject(deleteFolderRequest);					
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Folder deletion has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}							

If you run this code then seemingly everything’s gone fine, there are no exceptions thrown. However, the folder is still visible in the UI. The reason is that at present there’s a file within the folder and folders must not be deleted if they are not empty. So in order for the above code to succeed we need to first check it has any objects in it using ListObjectsRequest code snippet above. Then all those elements must be deleted. Only then can the folder itself be deleted.

In the next post we’ll connect all this to our previous demo on Amazon Kinesis where we want to save the incoming URL response time observations in S3 in an organised manner.

View all posts related to Amazon Web Services and Big Data here.

Using Amazon S3 with the AWS.NET API Part 3: code basics cont’d

Introduction

In the previous post we looked at some basic code examples for Amazon S3: list all buckets, create a new bucket and upload a file to a bucket.

In this post we’ll continue with some more code examples: downloading a resource, deleting it and listing the available objects.

We’ll extend the AmazonS3Demo C# console application with reading, listing and deleting objects.

Listing files in a bucket

The following method in S3DemoService.sc will list all objects within a bucket:

public void RunObjectListingDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			ListObjectsRequest listObjectsRequest = new ListObjectsRequest();
			listObjectsRequest.BucketName = "a-second-bucket-test";
			ListObjectsResponse listObjectsResponse = s3Client.ListObjects(listObjectsRequest);
			foreach (S3Object entry in listObjectsResponse.S3Objects)
			{
				Console.WriteLine("Found object with key {0}, size {1}, last modification date {2}", entry.Key, entry.Size, entry.LastModified);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Object listing has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

We use a ListObjectsRequest object to retrieve all objects from a bucket by providing the bucket name. For each object we print the key name, the object size and the last modification date, simple as that. In the previous post I uploaded a file called logfile.txt to the bucket called “a-second-bucket-test”. Accordingly, calling this method from Main…

static void Main(string[] args)
{
	S3DemoService demoService = new S3DemoService();
	demoService.RunObjectListingDemo();

	Console.WriteLine("Main done...");
	Console.ReadKey();
}

…yields the following output:

Found object with key logfile.txt, size 4490, last modification date 2014-12-06 13:25:45.

The ListObjectsRequest function provides some basic search functionality. The “Prefix” property will limit the search results to those objects whose names start with that prefix, e.g.:

listObjectsRequest.Prefix = "log";

That will find all objects whose key names start with “log”, i.e. logfile.txt is still listed.

You can list a limited number of elements, say 5:

listObjectsRequest.MaxKeys = 5;

You can also set a marker, meaning that the request will only list the files whose keys come after the marker value alphabetically:

listObjectsRequest.Marker = "leg";

This will find “logfile.txt”. However a marker value of “lug” won’t.

Download a file

Downloading a file from S3 involves reading from a Stream, a standard operation in the world of I/O. The following function will load the stream from logfile.txt, print its metadata and convert the downloaded byte array into a string:

public void RunDownloadFileDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			GetObjectRequest getObjectRequest = new GetObjectRequest();
			getObjectRequest.BucketName = "a-second-bucket-test";
			getObjectRequest.Key = "logfile.txt";
			GetObjectResponse getObjectResponse = s3Client.GetObject(getObjectRequest);
			MetadataCollection metadataCollection = getObjectResponse.Metadata;

			ICollection<string> keys = metadataCollection.Keys;
			foreach (string key in keys)
			{
				Console.WriteLine("Metadata key: {0}, value: {1}", key, metadataCollection[key]);
			}

			using (Stream stream = getObjectResponse.ResponseStream)
			{
				long length = stream.Length;
				byte[] bytes = new byte[length];
				int bytesToRead = (int)length;
				int numBytesRead = 0;
				do
				{
					int chunkSize = 1000;
					if (chunkSize > bytesToRead)
					{
						chunkSize = bytesToRead;
					}
					int n = stream.Read(bytes, numBytesRead, chunkSize);
					numBytesRead += n;
					bytesToRead -= n;
				}
				while (bytesToRead > 0);
				String contents = Encoding.UTF8.GetString(bytes);
				Console.WriteLine(contents);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Object download has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Run it from Main:

demoService.RunDownloadFileDemo();

In my case there was only one metadata entry:

Metadata key: x-amz-meta-type, value: log

…which is the one I attached to the file in the previous post.

In the above case we know beforehand that we’re reading text so the bytes could be converted into a string. However, this is of course not necessarily the case as you can store any file type on S3. The GetObjectResponse has a method which allows you to save the stream into a file:

getObjectResponse.WriteResponseStreamToFile("full file path");

…which has an overload to append the stream contents to an existing file:

getObjectResponse.WriteResponseStreamToFile("full file path", true);

Deleting a file

Deleting an object from S3 is just as straightforward as uploading it:

public void RunFileDeletionDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			DeleteObjectRequest deleteObjectRequest = new DeleteObjectRequest();
			deleteObjectRequest.BucketName = "a-second-bucket-test";
			deleteObjectRequest.Key = "logfile.txt";
			DeleteObjectResponse deleteObjectResponse = s3Client.DeleteObject(deleteObjectRequest);
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Object deletion has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Calling this from Main…

demoService.RunFileDeletionDemo();

…removes the previously uploaded logfile.txt from the bucket:

File removed from Amazon S3 bucket

In the next post we’ll see how to work with folders in code.

View all posts related to Amazon Web Services and Big Data here.

Using Amazon S3 with the AWS.NET API Part 2: code basics

Introduction

In the previous post we looked at an overview of Amazon S3 and we also tried a couple of simple operations on the GUI. In this post we’ll start coding. If you followed along the previous series on Amazon Kinesis then the setup section will be familiar to you. Otherwise I’ll assume that you are starting this series without having read anything else on this blog.

Note that we’ll be concentrating on showing and explaining the technical code examples related to AWS. We’ll ignore software principles like SOLID and layering so that we can stay focused. It’s your responsibility to organise your code properly. There are numerous posts on this blog that take up topics related to software architecture.

Installing the SDK

The Amazon .NET SDK is available through NuGet. Open Visual Studio 2012/2013 and create a new C# console application called AmazonS3Demo. The purpose of this application will be to demonstrate the different parts of the SDK around S3. In reality the S3 handler could be any type of application:

  • A website
  • A Windows/Android/iOS app
  • A Windows service
  • etc.

…i.e. any application that’s capable of sending HTTP/S requests to a web service endpoint. We’ll keep it simple and not waste time with view-related tasks.

Install the following NuGet package:

AWS SDK NuGet package

Preparations

We cannot just call the services within the AWS SDK without proper authentication. This is an important reference page to handle your credentials in a safe way. We’ll the take the recommended approach and create a profile in the SDK Store and reference it from app.config.

This series is not about AWS authentication so we won’t go into temporary credentials but later on you may be interested in that option too. Since we’re programmers and it takes a single line of code to set up a profile we’ll go with the programmatic options. Add the following line to Main:

Amazon.Util.ProfileManager.RegisterProfile("demo-aws-profile", "your access key id", "your secret access key");

I suggest you remove the code from the application later on in case you want to distribute it. Run the application and it should execute without exceptions. Next open app.config and add the appSettings section with the following elements:

<appSettings>
        <add key="AWSProfileName" value="demo-aws-profile"/>
</appSettings>

First demo: listing the available buckets

We’ll put all our test code into a separate class. Insert a cs file called S3DemoService. We’ll need a method to build a handle to the service which is of type IAmazonS3:

private IAmazonS3 GetAmazonS3Client()
{
	return Amazon.AWSClientFactory.CreateAmazonS3Client(RegionEndpoint.EUWest1);
}

Note that we didn’t need to provide our credentials here. They will be extracted automatically using the profile name in the config file.

Let’s first find out what buckets we have in S3. This is almost a trivial task:

public void RunBucketListingDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			ListBucketsResponse response = s3Client.ListBuckets();
			List<S3Bucket> buckets = response.Buckets;
			foreach (S3Bucket bucket in buckets)
			{
				Console.WriteLine("Found bucket name {0} created at {1}", bucket.BucketName, bucket.CreationDate);
			}
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Bucket listing has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode );
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

We use the client to list all the available buckets. The method returns a ListBucketsResponse object which will hold the buckets.

You’ll see a lot of these Request and Response objects throughout the AWS SDK. Amazon are fond of wrapping the request parameters and response properties into Request and Response objects adhering to the RequestResponse pattern.

We then list the name and creation date of our buckets. Call this method from Main:

static void Main(string[] args)
{
	S3DemoService demoService = new S3DemoService();
	demoService.RunBucketListingDemo();

	Console.WriteLine("Main done...");
	Console.ReadKey();
}

Example output:

Found bucket name a-first-bucket-test created at 2014-12-03 22:30:59

Second demo: creating a new bucket

Creating a new bucket is equally easy. Add the following method to S3DemoService:

public void RunBucketCreationDemo()
{
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			PutBucketRequest putBucketRequest = new PutBucketRequest();
			String newBucketName = "a-second-bucket-test";
			putBucketRequest.BucketName = newBucketName;
			PutBucketResponse putBucketResponse = s3Client.PutBucket(putBucketRequest);					
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("Bucket creation has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Again, we can see a Request and a corresponding Response object to create a bucket. We only specify a name which means that we go with the default values for e.g. permissions, they are usually fine. PutBucketRequest provides some properties to indicate values not adhering to the default ones. E.g. here’s how to give Everyone the permission to view the bucket:

S3Grant grant = new S3Grant();
S3Permission permission = new S3Permission("List");
S3Grantee grantee = new S3Grantee();
grantee.CanonicalUser = "Everyone";
grant.Grantee = grantee;
grant.Permission = permission;
List<S3Grant> grants = new List<S3Grant>() { grant };
putBucketRequest.Grants = grants;

Call RunBucketCreationDemo from Main as follows:

demoService.RunBucketCreationDemo();

Run the application and the bucket should be created:

Second bucket created from code Amazon S3

It’s not allowed to have two buckets with the same name. Run the application again and you should get an exception:

Bucket creation has failed.
Amazon error code: BucketAlreadyOwnedByYou
Exception message: Your previous request to create the named bucket succeeded and you already own it.

Third demo: file creation

Let’s upload a text file to the bucket we’ve just created. To be exact, we’ll upload its contents. Create some text file on your hard drive such as c:\logfile.txt and add some text to it. This example shows how to set the contents of the request. We also set an arbitrary metadata key-value pair of key “type” and value “log”. You can set any type of metadata you want:

public void RunFileUploadDemo()
{
	FileInfo filename = new FileInfo( @"c:\logfile.txt");
	string contents = File.ReadAllText(filename.FullName);
	using (IAmazonS3 s3Client = GetAmazonS3Client())
	{
		try
		{
			PutObjectRequest putObjectRequest = new PutObjectRequest();
			putObjectRequest.ContentBody = contents;
			putObjectRequest.BucketName = "a-second-bucket-test";
			putObjectRequest.Metadata.Add("type", "log");
			putObjectRequest.Key = filename.Name;
			PutObjectResponse putObjectResponse = s3Client.PutObject(putObjectRequest);
		}
		catch (AmazonS3Exception e)
		{
			Console.WriteLine("File creation has failed.");
			Console.WriteLine("Amazon error code: {0}",
				string.IsNullOrEmpty(e.ErrorCode) ? "None" : e.ErrorCode);
			Console.WriteLine("Exception message: {0}", e.Message);
		}
	}
}

Call this method from Main:

demoService.RunFileUploadDemo();

Run the application and the file with its custom metadata should be visible in S3:

File uploaded to Amazon S3 with metadata

We’ll continue with more examples in the next post.

View all posts related to Amazon Web Services and Big Data here.

Using Amazon S3 with the AWS.NET API Part 1: introduction

Introduction

Cloud-based blob storage solutions are abound and Amazon Web Services (AWS) is the leader – or one of the leaders – in that area. Amazon S3 (Simple Storage Service) provides a “secure, durable, highly-scalable object storage” solution, as it is stated on the homepage. You can use S3 to store just about any type of file: images, text files, videos, JAR files, HTML pages etc. All files are stored in a key-value map, i.e. each file has a key where the file itself is the value.

Purpose of S3

S3 is often used to store static components of web pages such as images or videos. S3 can be integrated with other Amazon components such as RedShift and Elastic MapReduce. It can also be used to transfer large amounts of data from one component to another.

However, in this series we’ll be concentrating on a slightly different but very specific usage: saving, deleting and checking for the existence of text based storage files. S3 can function as an important building block in a Big Data analysis system where a data mining application can pull the raw data from an S3 bucket, i.e. a container of files.

In S3 you can organise your files into pseudo-folders. I wrote “pseudo” as they are not real folders like the ones we create on Windows. The are rather used as visual containers so that you can organise your files in a meaningful way instead of putting all of them under the same bucket. Examples:

s3:\\sales\january-2015\monthly-sales.txt
s3:\\sales\february-2015\monthly-sales.txt

“sales” is the top bucket, “january-2015” is folder and then we have the file itself. You are free to create subfolders within each folder and subfolders within the subfolders etc.

Keep in mind that S3 is not used for updates though. Once you’ve uploaded a file to S3 then it cannot be updated in a one-step operation. Even if you want to edit a text file there’s no editor for it. You’ll need to delete the old file and upload a new one instead.

Amazon have also done a great job at providing SDKs for a range of platforms, like .NET, Java, Python etc. Programmatic access to S3’s services is equally available.

As with many other Amazon components don’t assume that S3 can only be used with other Amazon components. Any software capable of executing HTTP calls can use to access S3: Web API, .NET MVC, iOS, Java desktop apps, Windows services, you name it. Mixed architecture is quickly becoming the norm nowadays and S3 can be used both as part of a mixed solution or as an Amazon-only architecture.

Goals of this series

The goals of this series are two-fold:

  • Provide basic UI and programmatic knowledge to anyone looking for a fast, cheap, reliable and scalable cloud-based blob storage solution
  • Tie in with the previous series about another Amazon service called Kinesis which provides a reliable message queue service as entry point to a Big Data data mining system. We’ll take up S3 as an alternative for raw data storage towards the end of the series. The series on Kinesis ended where we stored the raw data on a local text file – we’ll replace it with S3 based storage

I’ll try to keep the two goals as separate as possible so that all readers with different motivations can follow through – those of you who are only interested in S3 as well as those that will see the greater picture.

I’ll assume that you have at least a test account of Amazon Web Services including the necessary access keys: an Amazon Access Key and a Secret Access Key. You’ll need to sign up with Amazon and then sign up for S3 within Amazon. You can create a free account on the S3 home page using the Create Free Account link button:

S3 create free account link button

Amazon has a lot of great documentation online. Should you get stuck you’ll almost always find an answer there. Don’t be afraid to ask in the comments section below.

In this post we’ll take an easy start and go through some of the visual aspects of S3.

S3 GUI

Log onto AWS and locate the S3 service link:

S3 link on Amazon UI

Before we go anywhere it’s important to mention regions in Amazon. A region in Amazon Web Services is the geographical location of the data centre where your selected service will be located. E.g. if you create a cloud-based server with Amazon EC2 in the region called US East then the server will be set up North Virginia. It doesn’t mean that your home page deployed on that server won’t be reachable anywhere else, it only means the physical location of the service. So if you’re expecting the bulk of your customers to come from Japan then it’s wise to set up the first web server in the region called Asia Pacific (Tokyo). Also, if you set up a service in e.g. EU (Ireland) on the AWS UI, then log out and log in again, your service may not be visible at first. A good guess is that you need to select the correct region. The region is indicated in the URL, e.g.:

https://console.aws.amazon.com/s3/home?region=eu-west-1#

…where “eu-west-1” stands for EU (Ireland). You can normally select the region in the top right hand corner of the Amazon UI where you’ll see the user-friendly names of the regions.

Regions are important for virtually all the components in AWS. Take a wild guess, which component is an exception. Check out the top right hand corner of the S3 screen:

No regions for Amazon S3

So regions don’t play the same role here as in other components in the AWS product offering. However, the selected region can still be used to “optimize for latency, minimize costs, or address regulatory requirements” as it says in the Create a Bucket windows we’ll soon see.

You’ll see that by default the screen will show all the top buckets:

S3 All buckets screen

Let’s try a couple of things to get our hands dirty. Click on the Create Bucket button. Give the bucket a name, select the nearest region to your location and press Create. We’ll skip logging right now:

Creating first bucket Amazon S3

The bucket is quickly created:

Created first bucket in Amazon S3

On the right hand side of the screen you can set various properties of the bucket such as logging and security:

Set properties of bucket in Amazon S3

We won’t go into them at all otherwise this series will lose its scope. The default values are usually fine for most purposes. If you ever need to modify the settings, especially those that have to do with permissions, then consult the AWS documentation of S3.

Click on the name of the bucket and you’ll see that it’s empty:

First bucket is empty in Amazon S3

Click the Upload button and click Add Files:

Upload a file in Amazon S3

Select some file on your hard drive, preferably a text file as they are easier to open. Click “Start Upload” and the file upload progress should appear on the right hand side:

Text file uploaded to bucket in Amazon S3

Click the file name to select it. The Actions drop-down will list several options available for a file such as Open, Download or Delete which are self-explanatory.

Now click the “Create Folder” button and give the folder a name:

Created folder in Amazon S3 bucket

You can click the folder name to open its contents – much like you do it with a double-click on the Windows file system. You can then upload different files in that folder and create subfolders.

This is enough for starters. We’ll start looking into some basic operations in code in the next post.

View all posts related to Amazon Web Services and Big Data here.

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: