Using Amazon Elastic MapReduce with the AWS.NET API Part 4: Hive basics with Hadoop
February 26, 2015 2 Comments
Introduction
In the previous post we started our first Amazon EMR cluster with Hadoop, Hive and a couple of other tools installed. We saw even our first examples of the Hive query language to retrieve the available tables and databases.
In this post we’ll continue with the basics of Hive. We’ll create a database and a table, fill it with some raw data and execute a couple of aggregation scripts on the data. We’ll finally store the aggregated values in another database.
Recall from the previous post that you can have internal, i.e. Managed tables that are stored on the Hadoop distributed file system HDFS. If you drop the table then you also lose the data stored in that table. If on the other hand you create an external table which points to a location outside HDFS you can drop the table without losing the source data. We’ll only work with internal tables in this post. We’ll create at least one external table later on in this series when we take up how EMR can work with Amazon S3 and DynamoDb. That part of the series will tie together what we have learnt so far about Big Data in AWS.
Remember that the Hive language manual is a good friend of yours when searching for the correct functions. This post is by no means a comprehensive list of examples. It’ll only scratch the surface of what’s available so that you can at least get the hang of it.
We’ll also see a couple examples of interacting with Hadoop using the command line. The below links provide very comprehensive guides:
Also, just a reminder: here‘s the Hive language manual with all the details on the commands and their options.
Preparations
Start an EMR cluster like we did in the previous post with one master and 2 slave nodes. Log onto the master node using Putty and type “hive” to open the Hive CLI.
Hive examples
Let’s first create a database:
create database if not exists url_db comment "This is a test Hadoop database with URL data";
The CLI always responds with some output. In this case it shows the time it took to carry out the command.
time taken: 0.5 seconds
The “comment” argument is optional, it is only metadata for the user/admin that’s not used by Hive itself.
Let’s double-check if our database has been created:
show databases;
Indeed it has:
default
url_db
Let’s drop the database…:
drop database url_db;
…and create it with another version of “create” where we specify some key/value properties that can be interesting for a user:
create database if not exists url_db comment "This is a test Hadoop database with URL data" with dbproperties ('environment'='test','type'='hadoop','purpose'='play');
The comment and properties will show up if we run the “describe” statement:
describe database extended url_db;
Here’s the output:
url_db This is a test Hadoop database with URL data hdfs://10.193.25.100:9000/user/hive/warehouse/url_db.db hadoop USER {environment=test, purpose=play, type=hadoop}
The hdfs URL points to the location of the database in the Hive warehouse.
From now on we’ll set the context of our queries to url_db:
use url_db;
We’ll create a very simple database which simulates the collection of URLs and their response times in milliseconds:
create table if not exists url_response_times (url string,response_time int) comment "This table stores the URL response times" row format delimited fields terminated by "|" tblproperties ('fun'='true');
We provide the properties and data types in parenthesis after the table name. The create table statement also has a comment and key/value properties options.
The “row format delimited” bit is also optional but here we’re preparing for the input format. The “terminated by” statement indicates that we’ll read the data from a file where the record properties are delimited in some way, in this case the pipe character. We’ll prepare the source in a bit.
Let’s just check if the table has been created:
show tables;
It has:
url_response_times
We can also read some extended and formatted information about the table as follows:
describe formatted url_response_times;
Here’s the output:
# col_name data_type comment
url string
response_time int
# Detailed Table Information
Database: url_db
Owner: hadoop
CreateTime: Tue Jan 27 20:09:07 UTC 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://10.193.25.100:9000/user/hive/warehouse/url_db.db/url_response_times
Table Type: MANAGED_TABLE
Table Parameters:
comment This table stores the URL response times
fun true
transient_lastDdlTime 1422389347
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
The next logical step is to insert some test values into the table. However, we have a blocking point. At the time of writing this post EMR comes with Hive version 0.13. The traditional “insert into table values” statement is only available from Hive 0.14. Thus that is not an option. The only option we have is to populate our table from another source that already exists, e.g. from another table or from a file.
We’ll go for the file option as hinted above. However, we first need to create it.
Let’s exit the Hive CLI:
exit;
You’ll get back to the “normal” Linux command line:
Let’s create a file using Nano:
nano source.txt
Enter the following highly scientific url and response time records:
http://www.facebook.com|512
http://www.facebook.com|742
http://www.yahoo.com|983
http://www.yahoo.com|1532
http://www.facebook.com|783
http://www.facebook.com|912
http://www.yahoo.com|437
http://www.yahoo.com|1764
http://www.yahoo.com|142
http://www.facebook.com|824
http://www.facebook.com|381
http://www.yahoo.com|798
http://www.yahoo.com|725
http://www.yahoo.com|236
http://www.facebook.com|315
http://www.facebook.com|853
When done save the file as follows:
Press ‘ctrl+x’
Press ‘y’ when asked if you want to save the file
File name to write: source txt., press Enter
Get file full path:
readlink -f source.txt
Should be /home/hadoop/source.txt
We need to copy the source file to HDFS.
Let’s first create a folder for our URL observations using the Hadoop command “fs” where fs stands for “filesystem”:
hadoop fs -mkdir /urls
Then let’s copy source.txt to the source folder on HDFS:
hadoop fs -put /home/hadoop/source.txt /urls/
We’re done with the source import. Let’s go back to the Hive CLI by typing “hive”.
Let’s set the context again:
use url_db;
We’re now ready to load the records in the source file to our table:
load data inpath '/urls/source.txt' into table url_response_times;
You should get an output similar to the following:
Loading data to table url_db.url_response_times
Table url_db.url_response_times stats: [numFiles=1, numRows=0, totalSize=315, rawDataSize=0]
OK
Time taken: 0.788 seconds
Let’s double-check if the records have been imported:
select * from url_response_times;
OK
http://www.facebook.com 512
http://www.facebook.com 742
http://www.yahoo.com 983
http://www.yahoo.com 1532
http://www.facebook.com 783
http://www.facebook.com 912
http://www.yahoo.com 437
http://www.yahoo.com 1764
http://www.yahoo.com 142
http://www.facebook.com 824
http://www.facebook.com 381
http://www.yahoo.com 798
http://www.yahoo.com 725
http://www.yahoo.com 236
http://www.facebook.com 315
http://www.facebook.com 853
We’re making progress!
We’ll now run a simple aggregation where we calculate the average response time of the URLs.
We first create a table that will store the aggregations:
create table if not exists url_aggregations (url string, avg_response_time double);
The next statement will run the aggregation. Note that the “overwrite” option does exactly what it implies i.e. overwrites what’s already present in the table:
insert overwrite table url_aggregations select url, avg(response_time) from url_response_times group by url;
Hive will start a Java MapReduce job in the background with output similar to the following:
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1422424447344_0001, Tracking URL = http://10.102.0.124:9046/proxy/application_1422424447344_0001/
Kill Command = /home/hadoop/bin/hadoop job -kill job_1422424447344_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-01-28 06:01:03,868 Stage-1 map = 0%, reduce = 0%
2015-01-28 06:01:12,145 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.45 sec
2015-01-28 06:01:20,392 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.14 sec
MapReduce Total cumulative CPU time: 3 seconds 140 msec
Ended Job = job_1422424447344_0001
Loading data to table url_db.url_aggregations
rmr: DEPRECATED: Please use ‘rm -r’ instead.
Deleted hdfs://10.102.0.124:9000/user/hive/warehouse/url_db.db/url_aggregations
Table url_db.url_aggregations stats: [numFiles=1, numRows=2, totalSize=46, rawDataSize=44]
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.14 sec HDFS Read: 552 HDFS Write: 125 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 140 msec
OK
Time taken: 28.007 seconds
You might think that 28 seconds are a long time for a simple aggregation. Yes, I agree, it is. In our own project we compared EMR with RedShift, which we’ll investigate in the next series and RedShift proved to be a lot faster. This is of course not to say that you should immediately forget EMR but it’s definitely worth trying the same tasks in both environments.
Here’s a link on how you can play with JVM and EC2 size settings to speed up the calculations in EMR.
Let’s finally check the average response times:
select * from url_aggregations;
They look correct to me:
http://www.facebook.com 665.25
http://www.yahoo.com 827.125
We’ll stop here as it’s now time to take a look into the EMR-related stuff in the AWS .NET library. We’ll start with that in the next post.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on Dinesh Ram Kali..
Reblogged this on SutoCom Solutions.