Using Amazon Elastic MapReduce with the AWS .NET API Part 6: Hive with Amazon S3 and DynamoDb
March 5, 2015 1 Comment
Introduction
In the previous post we looked at how to interact with EMR using the AWS .NET SDK. We saw examples of how to get the cluster list, start a new cluster, assign steps to the cluster and finally terminate it.
In this post we’ll return to the Hive CLI to see how EMR can interact with Amazon S3 and DynamoDb. We investigated both components here and here.
In this post of this series we created a source file directly on HDFS and loaded its contents into a table using Hive. Then we also saved the aggregated data in a Hive table. This allows you to use EMR on its own without having to know other Amazon components.
However Amazon components are typically well connected and EMR is no exception. We’ll do something similar in this post but we’ll put the source into an S3 bucket and will load the aggregated data into a DynamoDb table. We could load the aggregations into S3 again but we’re aiming to make the example more versatile.
You’ll obviously need to know the basics of and have access to S3 and DynamoDb. If you’re new to these components then you can skim through the posts referenced above to get an overview. You’ll only need to know how to create an S3 bucket, how to upload a file to it and how to create a database in DynamoDb. You won’t need to know anything about the AWS .NET SDK for S3 and DynamoDb for this post.
Preparations in S3
If you followed along the series on S3 then you’ll have a bucket called raw-urls-data. Otherwise create a bucket by this name. Make sure it’s empty so delete any existing folders in it so that they don’t interfere with our tests. Prepare a text file called raw-urls.txt with the following contents:
http://www.facebook.com|512
http://www.facebook.com|742
http://www.yahoo.com|983
http://www.yahoo.com|1532
http://www.facebook.com|783
http://www.facebook.com|912
http://www.yahoo.com|437
http://www.yahoo.com|1764
http://www.yahoo.com|142
http://www.facebook.com|824
http://www.facebook.com|381
http://www.yahoo.com|798
http://www.yahoo.com|725
http://www.yahoo.com|236
http://www.facebook.com|315
http://www.facebook.com|853
This is the same list of URL response time observations as we worked with in the post referenced above. Upload the file to the S3 bucket raw-urls-data.
Preparations in DynamoDb
Create a new table called aggregation-tests in the region of your preference. You should probably go with the same region as what you selected for the EMR clusters before for best speed. Select “Hash” for the primary key type and “id” as the hash attribute name. Leave its type as string. Click Continue until you reach the read/write throughput screen, specify 5 for both. On the Throughput Alarms screen you can choose not to get any notification, this is not a vital business-related table. Step through any remaining screens of the wizard and let the DB table start.
Interactions
Start the same type of EMR cluster as we did before on the EMR GUI: 1 master and 2 slaves. It’s up to you to define optional arguments like tags, they are not important for the purpose of this post. Log onto the master node using your Amazon security key file with Putty.
You may recall that we mentioned external tables in this series before. These are tables that are not managed and stored in the Hadoop file system. Instead, there’s only a reference to the table from Hadoop but the underlying source is stored elsewhere. If the table is deleted then the source is kept and you can easily recreate the table reference.
First type ‘hive’ to enter the Hive CLI. Execute the following command to create the external table:
create external table if not exists raw_urls (url string, response_time int) row format delimited fields terminated by '|' location 's3://raw-urls-data/';
The keyword “external” is not surprisingly used to create an external table. We provide the column names and properties as usual. We also indicate how the fields are delimited. The “location” keyword is used to point out the location of the source data. Note that we didn’t have to provide the file name – if you try that it will be rejected anyway. There can be more raw data files like raw-urls.txt, like raw-urls-january.txt, raw-urls-february.txt etc. and all of them will be processed, you won’t need to worry about pointing out each individual file.
Let’s check if the raw data has been processed:
select * from raw_urls;
…gives…:
OK
http://www.facebook.com 512
http://www.facebook.com 742
http://www.yahoo.com 983
http://www.yahoo.com 1532
http://www.facebook.com 783
http://www.facebook.com 912
http://www.yahoo.com 437
http://www.yahoo.com 1764
http://www.yahoo.com 142
http://www.facebook.com 824
http://www.facebook.com 381
http://www.yahoo.com 798
http://www.yahoo.com 725
http://www.yahoo.com 236
http://www.facebook.com 315
http://www.facebook.com 853
Let’s see what Hive can tell us about the table:
describe formatted raw_urls;
This yields and output similar to the following:
# col_name data_type comment
customername string
response_time int
# Detailed Table Information
Database: default
Owner: hadoop
CreateTime: Sat Jan 31 19:32:19 UTC 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: s3://raw-urls-data/
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL TRUE
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1422732739
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim |
serialization.format |
Note the table type: EXTERNAL_TABLE.
OK, let’s create another external table. It will be a reference to our DynamoDb aggregation table:
create external table if not exists aggregations (id string, url string, avg_response_time double) stored by 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' tblproperties("dynamodb.table.name" = "aggregation-tests", "dynamodb.region" = "eu-west-1", "dynamodb.column.mapping" = "id:id,url:url_name,avg_response_time:average_response_time");
The bits are the following:
- The “stored by” property is followed by an AWS-specific string that the slightly extended Amazon version of Hive understands as DynamoDb
- In the table properties we specify a couple of key-value properties for the table name and the region. Modify the region to reflect your selection
- We also indicate the column mapping with ‘:’ in between where the first element always refer to the Hive table column name and the second to the DynamoDb column name. I made up slightly different names for the Url and response times so that we see the effect in the DynamoDb later
Next, let’s create a temporary table that will hold our aggregations locally before loading its data into DynamoDb:
create table if not exists temporary_local_aggregation (id string, url string, avg_response_time double);
Nothing new here I suppose.
We’re now ready to run the aggregation and populate our local aggregation table:
insert overwrite table temporary_local_aggregation SELECT reflect("java.util.UUID", "randomUUID"), url, avg(response_time) FROM raw_urls GROUP BY url;
The only new bit here is the somewhat odd-looking reflect command. Keep in mind that Hive is a layer on top of Java and it’s possible to call Java functions using reflection. Hive has no built-in method to generate random IDs. Java on the other hand has the static java.util.UUID.randomUUID() method which can be executed using reflection.
Let’s see what the aggregations look like:
select * from temporary_local_aggregation;
Here’s the output – the GUIDs will of course differ from yours:
bd29c2c2-24eb-4632-b105-4e987c6d6b55 http://www.facebook.com 665.25
4765a283-ada8-4897-98df-6df06bf8db92 http://www.yahoo.com 827.125
insert into table aggregations select * from temporary_local_aggregation;
This will start a Hadoop job. Let it run complete and let’s check the contents of the target DynamoDb table:
Let’s drop our external tables:
drop table aggregations; drop table raw_urls;
Check back in S3 and DynamoDb. The source data and the aggregations are still intact, only the table references were dropped.
In the next post we’ll see how to carry out all these steps from the .NET library.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on Dinesh Ram Kali..