← Reading and clearing a Windows Event Log with C# .NET

Building a web service with Node.js in Visual Studio Part 3: MongoDb basics cont’d →

An example of using ShellCommandActivity on Amazon Data Pipeline

November 9, 2014 13 Comments

Introduction

Amazon Data Pipeline helps you automate recurring tasks and data import/export in the AWS environment.

In this post we’ll go through a very specific example of using Data Pipeline: run an arbitrary JAR file from an EC2 instance through a bash script. This may not be something you do every single day but I really could have used an example when I went through this process in a recent project.

The scenario is the following:

You are working on a project within the Amazon web services environment
You have a compiled JAR file saved on S3
The JAR file can carry out ANY activity – it can range from printing “Hello world” to the console window to a complex application that interacts with databases and/or other Amazon components to perform some composite action
You’d like to execute this file automatically with logging and retries

In that case Data Pipeline is an option to consider. It has several so-called activity types, like CopyActivity, HiveActivity or RedShiftCopyActivity. I won’t go into any of these – I’m not sure how to use them and I’d like to concentrate on the solution to the problem outlined above.

Scripts

The activity type to pick in this case is ShellCommandActivity. It allows you to run a Linux bash script on an EC2 instance – or an Elastic MapReduce instance, but I didn’t see any use of that in my case. You’ll need at least 2 elements: the JAR file to be executed and a bash script which loads the JAR file onto the EC2 instance created by Data Pipeline and then executes it.

So say you have the following compiled Java application in S3:

The accompanying bash script is extremely simple but make sure you create it in a Linux-based editor or, if you want to edit the script in Windows, in a Windows-compatible bash script editor. Do not create the script in a Windows-based text editor like Notepad or Notepad++. The linefeed character won’t be properly recognised by the Linux EC2 instance trying to run the script. You may see some strange behaviour such as the JAR file is downloaded but then it cannot be located.

Create a bash script with the following 2 rows:

aws s3 cp s3://bucket-for-blog/SimpleModelJarForDataPipeline.jar /home/ec2-user/SimpleModelJarForDataPipeline.jar
java -jar /home/ec2-user/SimpleModelJarForDataPipeline.jar

The first line calls upon the Amazon CLI to copy a file located on S3 into the /home/ec2-user/ folder on the generated EC2 machine. Data Pipeline will access the new EC2 instance under the default “ec2-user” username, i.e. not admin which can lead to authorisation problems. E.g. if the ec2-user won’t be able to save the file to just any folder on the EC2 instance so it’s wise to select the default home directory of that user.

The second line then executes the JAR file with standard java -jar.

Save the script, upload it to S3 and take note of its URL, such as s3://scripts/taskrunner.sh

Setting up Data Pipeline

Then in the Data Pipeline console you can create a new pipeline as follows:

1. Click “Create new pipeline”:

2. Give it some name, description, a schedule and a bucket for the logs in the Create Pipeline window and click Create

3. A new screen will open where you can add Activities, data nodes and do some other stuff:

You’ll see a panel on the right hand side of the screen with headers like Activities, DataNodes, Schedules etc.

4. Click the Add activity button. This will add a new activity with some default name like “DefaultActivity1” and the Activities section will open automatically.

5. Give the activity some name, select ShellCommandActivity as the type, the Schedule drop down should be populated with a name based on what type of schedule you created in the Create Pipeline window.

6. In the Add an optional field… drop-down select Script Uri and enter the S3 location of the bash script we created above.

7. In the Add an optional field… drop-down select Runs On. This will open a new drop-down list, select “Create new: Resource”. This will create a new Resource for you under the Resources tab although this is not visible for you at first. It will get the default name “DefaultResource1”.

8. Expand the Schedules tab and modify the schedule if necessary

9. Expand the Resources tab. Add the resource some name instead of “DefaultResource1”. This will automatically overwrite the resource name in the activity you created in step 7.

10. For the type select Ec2Resource. This will populate the Role and Resource Role drop down lists to DataPipelineDefaultRole and DataPipelineDefaultResourceRole. This means that the EC2 resource will execute the job with the rights defined for the DataPipelineDefaultResourceRole. We’ll come back to this a little later. You can leave these values as they are or change to a different role available among the drop-down values.

11. Add the following optional fields:

Instance type: provide the size of the EC2 instance that should execute the job, such as m1.small. m1.micro is often enough as the EC2 resource will only be used to execute a JAR and not host some large website.
Region: select the region where the EC2 instance will be spawned. It’s good to select the same region as where you have set up the pipeline
Terminate After: a safety measure so that the EC2 instance doesn’t run for ever, specify e.g. 30 minutes
Key Pair: enter the name of the key pair. You’ll probably have at least one key-pair if you’re already using EC2 in Amazon. Otherwise you can find the instructions here: . Just enter the name of the key as you saved it like “DP key” or “Elvis Presley”. Make sure you have access to this key i.e. you have exported it from Amazon and saved it somewhere safe. It’s a good idea to assign a key pair to the EC2 instance as you may need to log onto it with e.g. Putty during job execution to test your bash script.
Image id: provide the AMI which will be the basis for the EC2 instance. This is quite an important field. If you don’t provide an AMI then Data Pipeline will select some default one. At the time of writing this post the default AMI in us-east-1 was “ami-05355a6c”. However, that instance doesn’t install Java and the latest AWS CLI on the EC2 so our bash script will inevitably fail. Hence first create an AMI which installs these tools on EC2 during the creation process.

That’s it, click Save pipeline. DP will probably complain about some validation exceptions. Review them under Errors/Warnings. Example messages:

Insufficient permission to describe key pair
Insufficient permission to describe image id
resourceRole ‘…’ has insufficient permissions to run datapipeline due to…

This last message is followed by a long range of missing role types. Frankly, I don’t know why these messages appear and how to make them go away, but I simply chose to ignore them and the pipeline will still work.

Then click Save pipeline and you should be good to go. There will be stderr and stdout messages to review any messages and exceptions during the JAR file execution.

Before we finish here’s one tip regarding the DataPipelineDefaultResourceRole role. If your JAR file accesses other AWS resources, such as DynamoDb or S3, then it may fail. Review the stderr output after the job has been executed, you may see something similar:

You see that DataPipelineDefaultResourceRole has no rights to execute the ListClusters action on an Elastic MapReduce cluster. In this case you need to extend the permissions of the role in the IAM console. Click “Roles” on the left hand panel, select DataPipelineDefaultResourceRole and then click “Manage Policy”:

You’ll see a list of permissions as JSON. In the above case I would extend the JSON with the following:

“elasticmapreduce:ListClusters”

…i.e. exactly as it said in the exception message.

Depending on your exception you may need to add something else, like “dynamodb:Scan” or “cloudwatch:PutMetricData”.

View all posts related to Amazon Web Services here.

Filed under Amazon Tagged with amazon, amazon web services, aws, cloud, data pipeline

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

13 Responses to An example of using ShellCommandActivity on Amazon Data Pipeline

Naga says:

July 8, 2015 at 9:30 pm

Can we leverage existing ec2 instance for running shell script ?

Reply
- Andras Nemes says:
  
  July 9, 2015 at 6:16 am
  
  I believe you can with EC2 worker groups:
  
  http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html
  
  However, I’ve never tested it myself so I can’t help you with the details.
  //Andras
  
  Reply
andavarrajanr says:

October 27, 2015 at 3:03 pm

do you have any source code for the above described task using aws api.

Reply
- Andras Nemes says:
  
  October 27, 2015 at 6:43 pm
  
  Hello, there’s no code at all associated with this post. The post describes UI actions in AWS. //Andras
  
  Reply
andavarrajanr says:

October 27, 2015 at 3:09 pm

is there any link to learn to automate the above task programatically using AWS Datapipeline APIs

Reply
- Andras Nemes says:
  
  October 27, 2015 at 6:45 pm
  
  When in doubt always turn to the AWS docs. Data pipeline: http://docs.aws.amazon.com/datapipeline/latest/APIReference/Welcome.html
  
  Reply
andavarrajanr says:

October 29, 2015 at 5:08 pm

Hi, im getting following errors after following the above steps.
They are
Key pair does not exist.
image id does not exist.

But i have both of them and able to connect the instance created by the image and key using putty.

Reply
andavarrajanr says:

October 29, 2015 at 8:27 pm

I have fixed the above problem. It is because of region mismatch.

I have doubts in preparing bash file. Is there any lines required along with the 2 lines,

like #!/bin/bash

Reply
moh says:

January 21, 2017 at 1:46 pm

outstanding , this blog really helped. I have a one question though : I have a standalone non-emr cluster that i need to execute shell script on through a driver machine connected to the cluster. do i need a task runner to be installed on the spark cluster ?

Reply
Saida says:

August 31, 2017 at 8:46 pm

Hi,

I was not able to copy the file/jar from S3 to EC2 instance. It is giving an error as below:
amazonaws.datapipeline.taskrunner.TaskExecutionException: usage: aws s3 operation
aws s3: error: argument operation: invalid choice: ‘cp’ (choose from u’list-object-versions’, u’put-bucket-policy’, u’list-objects’, u’put-bucket-website’, u’put-bucket-notification’, u’put-bucket-logging’, u’upload-part’, u’put-object’, u’delete-bucket-cors’, u’put-bucket-versioning’, u’get-bucket-cors’, u’put-bucket-lifecycle’, u’get-bucket-acl’, u’get-bucket-logging’, u’head-bucket’, u’put-bucket-acl’, u’delete-bucket-website’, u’delete-bucket-policy’, u’delete-objects’, u’get-object’, u’copy-object’, u’list-buckets’, u’put-bucket-request-payment’, u’head-object’, u’delete-bucket-tagging’, u’get-object-torrent’, u’get-bucket-lifecycle’, u’create-bucket’, u’complete-multipart-upload’, u’get-bucket-website’, u’create-multipart-upload’, u’delete-bucket’, u’get-bucket-policy’, u’get-bucket-versioning’, u’list-multipart-uploads’, u’get-bucket-request-payment’, u’put-bucket-tagging’, u’get-bucket-tagging’, u’abort-multipart-upload’, u’put-object-acl’, u’get-bucket-location’, u’put-bucket-cors’, u’delete-bucket-lifecycle’, u’get-bucket-notification’, u’list-parts’, u’get-object-acl’, u’upload-part-copy’, u’delete-object’, u’restore-object’, ‘help’)

Reply
- ced says:
  
  November 29, 2017 at 4:33 pm
  
  I had a similar problem. needed to use a newer ami with the newest aws cli. the default ami is a 2013 aws Linux and runs some old libs
  
  Reply
Pingback: Confluence: Behavior Analytics
Pingback: Confluence: Data Engineering

Exercises in .NET with Andras Nemes

13 Responses to An example of using ShellCommandActivity on Amazon Data Pipeline

Leave a comment Cancel reply

My profile

Andras Nemes

Personal Links

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Blogs I Follow

Exercises in .NET with Andras Nemes

An example of using ShellCommandActivity on Amazon Data Pipeline

Share:

Related

13 Responses to An example of using ShellCommandActivity on Amazon Data Pipeline

Leave a comment Cancel reply

My profile

Andras Nemes

Personal Links

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Keywords

Blogs I Follow