Using Amazon Elastic MapReduce with the AWS.NET API Part 1: introduction
February 16, 2015 Leave a comment
Introduction
Big Data is everywhere nowadays. It’s one of the most (over)used keywords in IT nowadays. Organisations have to process large amounts of information in form of messages in real time in order to make decisions about the future of the company. Companies can also use these messages as data points of something they monitor constantly: sales, response times, stock prices, etc. Their goal is presumably to process the data and extract information from it that they themselves or their customers find useful and are willing to pay for.
Amazon Web Services have a solution for pretty much anything in IT so it’s no surprise that there are at least two solutions for data mining and analysis scenarios:
- Elastic MapReduce (EMR): a Hadoop-based tool which is the subject of this series
- Amazon RedShift: a columnar data warehousing solution which we’ll investigate in the next series
This is the fourth installment of a series dedicated to out-of-the-box components built and powered by Amazon Web Services (AWS) enabling Big Data handling. The components we have looked at so far are the following:
- Message handler Amazon Kinesis
- Message persistence Amazon S3
- Aggregation data persistence Amazon DynamoDb
You can use EMR to build a large cluster of distributed data stores and use it without using any other Amazon component. However, the scope of this series is not to build a large cluster with petabytes of data. The primary goals are the following:
- How to use EMR in code with the AWS .NET SDK
- How to log into an EMR instance and try some basic commands there using the Hadoop command line with a language called Hive
- Show examples of how EMR can be used together with Amazon S3 and DynamoDb
- Show how EMR can fit into a larger Amazon-based Big Data architecture
In summary the goal is to introduce EMR to those who haven’t used it so far and are looking for an out-of-the-box cloud based Big Data aggregation and data mining tool. You can quickly set up a database cluster using EMR without worrying about servers, maintenance, security etc. You can very quickly start evaluating Hadoop and don’t need to spend much time on the set-up phase. In other words, you can concentrate on solving the real issues instead.
As hinted at above we’ll not only use C# to communicate with and EMR instance. It’s not possible to send C# commands to an EMR instance, it can’t be interpreted there of course. We’ll deal with an SQL-like language called Hive.
Here comes a little “warning”: EMR and Hadoop are large and complex topics with a steep learning curve. It’s not possible to describe each in detail on a blog, that would fill a large textbook. So be prepared for a lot of new key terms, technologies and research especially if both EMR and Hadoop are new to you.
Prerequisites
You’ll need to have at least a trial account in Amazon Web Services if you want to try the code examples yourself. Elastic MapReduce is not eligible for AWS free-tier at the time of writing this post. However, Amazon is bringing down its prices on AWS components quite often as their volumes grow larger and EMR is no exception. This page shows you some pricing details.
You’ll also need the necessary AWS access keys: an Amazon Access Key and a Secret Access Key. You’ll need to sign up for EMR in order to create a user account. You can create an account on the EMR home page. You’ll see a large yellow button on the right hand side of the screen: Sign in or create an AWS account.
By signing up with Amazon and creating a user you’ll also get a pair of security keys: an Amazon Access Key and a Secret Access Key. Keep them somewhere safe.
Amazon has a lot of great documentation online. Should you get stuck you’ll almost always find an answer there. Don’t be afraid to ask in the comments section below.
The AWS tools we’ve looked at so far in this Big Data series can easily be used in isolation, i.e. without the need to access other AWS tools: Kinesis for messaging, S3 for blob data storage and DynamoDb as a NoSql database. EMR on the other hand is slightly different. You can use it in isolation as a standalone Hadoop cluster. However, it integrates very smoothly with other Amazon services and the Hadoop cluster will run on Amazon EC2 instances. Therefore it’s highly recommended that you also have access to and have some knowledge of Amazon EC2, S3 and DynamoDb. When an EMR cluster is created then a number of EC2 machine instances are spun up, depending on the cluster size. Also, EMR can easily import raw data from S3 and/or DynamoDb for further processing.
Elastic MapReduce described
I took up EMR in another series related to Big Data, I’ll copy the relevant parts here.
Elastic MapReduce (EMR) is Amazon’s web service with the Hadoop framework installed. What’s MapReduce?
Short summary of MapReduce
MapReduce – or Map/Filter/Reduce – is eagerly used in data mining and big data applications to find information from a large, potentially unstructured data set. E.g. finding the average age of all Employees who have been employed for more than 5 years is a good candidate for this algorithm.
The individual parts of Map/Filter/Reduce, i.e. the Map, the Filter and the Reduce are steps or operations in a chain to compute something from a collection. Not all 3 steps are required in all data mining cases. Examples:
- Finding the average age of employees who have been working at a company for more than 5 years: you map the age property of each employee to a list of integers but filter out those who have been working for less than 5 years. Then you calculate the average of the elements in the integer list, i.e. reduce the list to a single outcome.
- Finding the ids of every employee: if the IDs are strings then you can map the ID fields into a list of strings, there’s no need for any filtering or reducing.
- Finding the average age of all employees: you map the age of each employee into an integer list and then calculate the average of those integers in the reduce phase, there’s no need for filtering
- Find all employees over 50 years of age: we filter out the employees who are younger than 50, there’s no need for mapping or reducing the employees collection.
MapReduce implementations in reality can become quite complex depending on the query and structure of the source data.
You can read more about MapReduce in general here.
EMR continued
Hadoop is eagerly used in the distributed computing world where large amounts of potentially unstructured input files are stored across several servers. The final goal is of course to process these raw data files in some way, like what we’re trying to do in this series. If you’re planning to work with Big Data and data-mining then you’ll definitely come across Hadoop at some point. Hadoop is a large system with a steep learning curve.
Also, there’s a whole ecosystem built on top of Hadoop to simplify writing MapReduce functions and queries. There are even companies out there, like Hortonworks, who specialise in GUI tools around Hadoop so that you don’t have to write plain queries in a command line tool.
Some of the technologies around Hadoop are the following:
- Java MapReduce: you can write MapReduce applications with Java. These are quite low-level operations and probably not too many Big Data developers go with this option in practice but I might be wrong here
- Hive: an SQL-like language that creates Java MapReduce functions under the hood and runs them. This is a lot less complex solution than writing a Java MapReduce function
- Pig: Pig – or Pig Latin – is another language/technology to handle large amounts of data in a fast, parallel and efficient way
…and a whole lot more but Hive and Pig seem to be the most popular ones at the time of writing this post.
Hadoop is normally installed on Linux machines. Therefore if you start a new instance of EMR in Amazon then you’ll be given a Linux EC2 machine with Hadoop and a couple of other tools installed, e.g. Hive and Pig will follow along although you can opt out of those when you start a new Hadoop instance. It’s good to have them installed so that you can test your Hive/Pig scripts on them.
This is enough for starters. In the next post we’ll start looking at the EMR GUI.
View all posts related to Amazon Web Services and Big Data here.