Using Amazon Elastic MapReduce with the AWS .NET API Part 6: Hive with Amazon S3 and DynamoDb


In the previous post we looked at how to interact with EMR using the AWS .NET SDK. We saw examples of how to get the cluster list, start a new cluster, assign steps to the cluster and finally terminate it.

In this post we’ll return to the Hive CLI to see how EMR can interact with Amazon S3 and DynamoDb. We investigated both components here and here.

In this post of this series we created a source file directly on HDFS and loaded its contents into a table using Hive. Then we also saved the aggregated data in a Hive table. This allows you to use EMR on its own without having to know other Amazon components.

However Amazon components are typically well connected and EMR is no exception. We’ll do something similar in this post but we’ll put the source into an S3 bucket and will load the aggregated data into a DynamoDb table. We could load the aggregations into S3 again but we’re aiming to make the example more versatile.

You’ll obviously need to know the basics of and have access to S3 and DynamoDb. If you’re new to these components then you can skim through the posts referenced above to get an overview. You’ll only need to know how to create an S3 bucket, how to upload a file to it and how to create a database in DynamoDb. You won’t need to know anything about the AWS .NET SDK for S3 and DynamoDb for this post.

Preparations in S3

If you followed along the series on S3 then you’ll have a bucket called raw-urls-data. Otherwise create a bucket by this name. Make sure it’s empty so delete any existing folders in it so that they don’t interfere with our tests. Prepare a text file called raw-urls.txt with the following contents:|512|742|983|1532|783|912|437|1764|142|824|381|798|725|236|315|853

This is the same list of URL response time observations as we worked with in the post referenced above. Upload the file to the S3 bucket raw-urls-data.

Preparations in DynamoDb

Create a new table called aggregation-tests in the region of your preference. You should probably go with the same region as what you selected for the EMR clusters before for best speed. Select “Hash” for the primary key type and “id” as the hash attribute name. Leave its type as string. Click Continue until you reach the read/write throughput screen, specify 5 for both. On the Throughput Alarms screen you can choose not to get any notification, this is not a vital business-related table. Step through any remaining screens of the wizard and let the DB table start.


Start the same type of EMR cluster as we did before on the EMR GUI: 1 master and 2 slaves. It’s up to you to define optional arguments like tags, they are not important for the purpose of this post. Log onto the master node using your Amazon security key file with Putty.

You may recall that we mentioned external tables in this series before. These are tables that are not managed and stored in the Hadoop file system. Instead, there’s only a reference to the table from Hadoop but the underlying source is stored elsewhere. If the table is deleted then the source is kept and you can easily recreate the table reference.

First type ‘hive’ to enter the Hive CLI. Execute the following command to create the external table:

create external table if not exists raw_urls (url string, response_time int) row format delimited fields terminated by '|' location 's3://raw-urls-data/';

The keyword “external” is not surprisingly used to create an external table. We provide the column names and properties as usual. We also indicate how the fields are delimited. The “location” keyword is used to point out the location of the source data. Note that we didn’t have to provide the file name – if you try that it will be rejected anyway. There can be more raw data files like raw-urls.txt, like raw-urls-january.txt, raw-urls-february.txt etc. and all of them will be processed, you won’t need to worry about pointing out each individual file.

Let’s check if the raw data has been processed:

select * from raw_urls;


OK 512 742 983 1532 783 912 437 1764 142 824 381 798 725 236 315 853

Let’s see what Hive can tell us about the table:

describe formatted raw_urls;

This yields and output similar to the following:

# col_name data_type comment

customername string
response_time int

# Detailed Table Information
Database: default
Owner: hadoop
CreateTime: Sat Jan 31 19:32:19 UTC 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: s3://raw-urls-data/
Table Parameters:
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1422732739

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim |
serialization.format |

Note the table type: EXTERNAL_TABLE.

OK, let’s create another external table. It will be a reference to our DynamoDb aggregation table:

create external table if not exists aggregations (id string, url string, avg_response_time double) stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties("" = "aggregation-tests", "dynamodb.region" = "eu-west-1", "dynamodb.column.mapping" = "id:id,url:url_name,avg_response_time:average_response_time");

The bits are the following:

  • The “stored by” property is followed by an AWS-specific string that the slightly extended Amazon version of Hive understands as DynamoDb
  • In the table properties we specify a couple of key-value properties for the table name and the region. Modify the region to reflect your selection
  • We also indicate the column mapping with ‘:’ in between where the first element always refer to the Hive table column name and the second to the DynamoDb column name. I made up slightly different names for the Url and response times so that we see the effect in the DynamoDb later

Next, let’s create a temporary table that will hold our aggregations locally before loading its data into DynamoDb:

create table if not exists temporary_local_aggregation (id string, url string, avg_response_time double); 

Nothing new here I suppose.

We’re now ready to run the aggregation and populate our local aggregation table:

insert overwrite table temporary_local_aggregation SELECT reflect("java.util.UUID", "randomUUID"), url, avg(response_time) FROM raw_urls GROUP BY url;

The only new bit here is the somewhat odd-looking reflect command. Keep in mind that Hive is a layer on top of Java and it’s possible to call Java functions using reflection. Hive has no built-in method to generate random IDs. Java on the other hand has the static java.util.UUID.randomUUID() method which can be executed using reflection.

Let’s see what the aggregations look like:

select * from temporary_local_aggregation;

Here’s the output – the GUIDs will of course differ from yours:

bd29c2c2-24eb-4632-b105-4e987c6d6b55 665.25
4765a283-ada8-4897-98df-6df06bf8db92 827.125

insert into table aggregations select * from temporary_local_aggregation;

This will start a Hadoop job. Let it run complete and let’s check the contents of the target DynamoDb table:

DynamoDb aggregation table populated

Let’s drop our external tables:

drop table aggregations;
drop table raw_urls;

Check back in S3 and DynamoDb. The source data and the aggregations are still intact, only the table references were dropped.

In the next post we’ll see how to carry out all these steps from the .NET library.

View all posts related to Amazon Web Services and Big Data here.


About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

One Response to Using Amazon Elastic MapReduce with the AWS .NET API Part 6: Hive with Amazon S3 and DynamoDb

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


A great site

iReadable { }

.NET Tips & Tricks

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: