Using Amazon Elastic MapReduce with the AWS.NET API Part 2: the cluster startup GUI
February 19, 2015 1 Comment
Introduction
In the previous post we went through a quick introduction of Amazon Elastic MapReduce. In this post we’ll very shortly describe Hadoop first. Then we’ll log into EMR and go through the long section in the GUI which you fill in to specify the details of your cluster.
If you are entirely new to Hadoop and Amazon Web Services then you’ll see a lot of new tools mentioned on this page that hide some complex applications with their own learning curves. So get ready for a lot of reading and research. It can be of great help if you have people in your organisation who already have some prior knowledge of AWS when starting with EMR.
What is Hadoop at all?
Apache Hadoop is an important player in the world of Big Data. Although we’ll be concentrating more on EMR with .NET, i.e. this is not a series built strictly around Hadoop, it’s useful to know some basic terms. Here comes an introduction copied directly from the Hadoop home page:
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets
For the sake of this series the most important term you should remember is HDFS. Also, keep in mind that a cluster usually consists of a master node and one or more slaves – this is typical of distributed data analysis tools. Although for testing purposes you can have a single master node. The computations are distributed across the available nodes. Hadoop clusters are designed to be very fault resistant and highly scalable. It is especially eagerly used in the world of Internet of Things which you’ve probably heard about by now.
In the next series we’ll see an alternative to EMR: Amazon RedShift which also follows a distributed architecture.
The EMR interface
Log onto the AWS console at https://aws.amazon.com/ and select the EMR menu point:
Probably every service you use with AWS has a region that you can select in the top right section of the UI:
Note that in your case a different region might be pre-selected so don’t get confused by the fact that it says “Singapore” in the above screenshot. Click the down-pointing arrow to view all the available regions. These regions are significant for all services with a couple of exceptions. E.g. S3, which we discussed in this series, is global and has less regional significance. In the case of EMR when you create a new cluster then that cluster will be available in the selected region. It doesn’t, however, mean that users cannot access the cluster located in Ireland from Australia. However, it will take Australian users a bit more time to complete the operations related to Hadoop than it does for a user in the UK. Also, we’ll see later that the region must be specified in code when configuring the access to AWS otherwise you may be wondering why your cluster cannot be located.
If this is the first time you use EMR then you’ll probably see the below button to create your first cluster:
Click on that button and you’ll come to the Create Cluster page:
Each section has several links with explanations so it’s futile to regurgitate every little detail here. That could fill a whole book. Let’s just briefly go through the sections. We’ll start out first cluster in the next post. You’ll probably be busy learning about a long list of new concepts for a couple of days.
Cluster configuration
Cluster name is self-explanatory I guess. Termination protection means that the cluster cannot just be terminated by terminating it in the UI later on. Keep in mind that each node of the cluster will start an Amazon EC2 machine with Linux installed. With no protection those instances could simply be removed from the EC2 GUI – we’ll see later how the cluster members show up in the EC2 GUI.
Logging output can be saved in Amazon S3. Learn more about logging and debugging here.
Tags
Tags help you identify your cluster by metadata. These are key-value pairs of strings which have no meaning to EMR itself, they are meant for the administrator to see e.g. their purpose or which environment they belong to.
Software configuration
Specify which AMI – Amazon Machine Image – version to use and which Hadoop distribution: “An Amazon Machine Image (AMI) provides the information required to launch an instance, which is a virtual server in the cloud. You specify an AMI when you launch an instance, and you can launch as many instances from the AMI as you need. You can also launch instances from as many different AMIs as you need.” In the Applications subsection you can select which apps to install. These are all applications built upon Hadoop – you’ll recognise Hive and Pig. Hue is an open-source, web-based graphical user interface for use with Amazon Elastic MapReduce and Apache Hadoop. Hue groups together several different Hadoop ecosystem projects into a configurable interface for your Amazon EMR cluster.
File system configuration
These are settings regarding HDFS referred to above and EMRFS which is an Amazon-variant of HDFS for S3 storage.
Hardware configuration
This is where you can specify the size and number of instances you’d like to have on your cluster nodes. I won’t go through the details of the EC2 instance types here, there are so many of them. This page gives you a detailed account of what they are good for.
For extra security you can select a virtual private cloud.
Security and access
This is very important. You should have an EC2 key pair ready so that you’ll be able to log into the instance from your own computer. The link provides the steps to create key-pair and prepare it for SSH through Putty. The Putty page also lists the Puttygen product which will actually convert the PEM file into a PPK one. You can go ahead and create one such key-pair already now, we’ll need it later. The links provide step-by-step guides, it takes maybe 10-15 minutes from scratch.
Bootstrap actions
These are special actions that you’d like to carry out when the cluster is being launched. Check out the “Add bootstrap action” drop down list for what’s available.
Steps
The idea is that you can already at this phase assign jobs to a cluster. The “Add step” drop down list will give you an idea on what job types are available. E.g. a “Hive program” will carry out a Hive script on the cluster. The Hive script may execute a number of functions such as import a data set from S3, compute a number of aggregations over the data set and finally export the result to another data store. You can then decide whether to terminate the cluster after the job’s done or you’d like to keep it running – you might need it for additional jobs later on.
That’s enough for now. In the next post we’ll start our first cluster.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on SutoCom Solutions.