An example of using ShellCommandActivity on Amazon Data Pipeline

Introduction

Amazon Data Pipeline helps you automate recurring tasks and data import/export in the AWS environment.

In this post we’ll go through a very specific example of using Data Pipeline: run an arbitrary JAR file from an EC2 instance through a bash script. This may not be something you do every single day but I really could have used an example when I went through this process in a recent project.

The scenario is the following:

  • You are working on a project within the Amazon web services environment
  • You have a compiled JAR file saved on S3
  • The JAR file can carry out ANY activity – it can range from printing “Hello world” to the console window to a complex application that interacts with databases and/or other Amazon components to perform some composite action
  • You’d like to execute this file automatically with logging and retries

In that case Data Pipeline is an option to consider. It has several so-called activity types, like CopyActivity, HiveActivity or RedShiftCopyActivity. I won’t go into any of these – I’m not sure how to use them and I’d like to concentrate on the solution to the problem outlined above.

Scripts

The activity type to pick in this case is ShellCommandActivity. It allows you to run a Linux bash script on an EC2 instance – or an Elastic MapReduce instance, but I didn’t see any use of that in my case. You’ll need at least 2 elements: the JAR file to be executed and a bash script which loads the JAR file onto the EC2 instance created by Data Pipeline and then executes it.

So say you have the following compiled Java application in S3:

JAR file in S3

The accompanying bash script is extremely simple but make sure you create it in a Linux-based editor or, if you want to edit the script in Windows, in a Windows-compatible bash script editor. Do not create the script in a Windows-based text editor like Notepad or Notepad++. The linefeed character won’t be properly recognised by the Linux EC2 instance trying to run the script. You may see some strange behaviour such as the JAR file is downloaded but then it cannot be located.

Create a bash script with the following 2 rows:

aws s3 cp s3://bucket-for-blog/SimpleModelJarForDataPipeline.jar /home/ec2-user/SimpleModelJarForDataPipeline.jar
java -jar /home/ec2-user/SimpleModelJarForDataPipeline.jar

The first line calls upon the Amazon CLI to copy a file located on S3 into the /home/ec2-user/ folder on the generated EC2 machine. Data Pipeline will access the new EC2 instance under the default “ec2-user” username, i.e. not admin which can lead to authorisation problems. E.g. if the ec2-user won’t be able to save the file to just any folder on the EC2 instance so it’s wise to select the default home directory of that user.

The second line then executes the JAR file with standard java -jar.

Save the script, upload it to S3 and take note of its URL, such as s3://scripts/taskrunner.sh

Setting up Data Pipeline

Then in the Data Pipeline console you can create a new pipeline as follows:

1. Click “Create new pipeline”: Create new pipeline button

2. Give it some name, description, a schedule and a bucket for the logs in the Create Pipeline window and click Create

3. A new screen will open where you can add Activities, data nodes and do some other stuff:

Create pipeline UI

You’ll see a panel on the right hand side of the screen with headers like Activities, DataNodes, Schedules etc.

4. Click the Add activity button. This will add a new activity with some default name like “DefaultActivity1” and the Activities section will open automatically.

5. Give the activity some name, select ShellCommandActivity as the type, the Schedule drop down should be populated with a name based on what type of schedule you created in the Create Pipeline window.

6. In the Add an optional field… drop-down select Script Uri and enter the S3 location of the bash script we created above.

7. In the Add an optional field… drop-down select Runs On. This will open a new drop-down list, select “Create new: Resource”. This will create a new Resource for you under the Resources tab although this is not visible for you at first. It will get the default name “DefaultResource1”.

8. Expand the Schedules tab and modify the schedule if necessary

9. Expand the Resources tab. Add the resource some name instead of “DefaultResource1”. This will automatically overwrite the resource name in the activity you created in step 7.

10. For the type select Ec2Resource. This will populate the Role and Resource Role drop down lists to DataPipelineDefaultRole and DataPipelineDefaultResourceRole. This means that the EC2 resource will execute the job with the rights defined for the DataPipelineDefaultResourceRole. We’ll come back to this a little later. You can leave these values as they are or change to a different role available among the drop-down values.

11. Add the following optional fields:

That’s it, click Save pipeline. DP will probably complain about some validation exceptions. Review them under Errors/Warnings. Example messages:

  • Insufficient permission to describe key pair
  • Insufficient permission to describe image id
  • resourceRole ‘…’ has insufficient permissions to run datapipeline due to…

This last message is followed by a long range of missing role types. Frankly, I don’t know why these messages appear and how to make them go away, but I simply chose to ignore them and the pipeline will still work.

Then click Save pipeline and you should be good to go. There will be stderr and stdout messages to review any messages and exceptions during the JAR file execution.

Before we finish here’s one tip regarding the DataPipelineDefaultResourceRole role. If your JAR file accesses other AWS resources, such as DynamoDb or S3, then it may fail. Review the stderr output after the job has been executed, you may see something similar:

IAM to be extended

You see that DataPipelineDefaultResourceRole has no rights to execute the ListClusters action on an Elastic MapReduce cluster. In this case you need to extend the permissions of the role in the IAM console. Click “Roles” on the left hand panel, select DataPipelineDefaultResourceRole and then click “Manage Policy”:

Manage role IAM

You’ll see a list of permissions as JSON. In the above case I would extend the JSON with the following:

“elasticmapreduce:ListClusters”

…i.e. exactly as it said in the exception message.

Depending on your exception you may need to add something else, like “dynamodb:Scan” or “cloudwatch:PutMetricData”.

View all posts related to Amazon Web Services here.

Advertisement

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

13 Responses to An example of using ShellCommandActivity on Amazon Data Pipeline

  1. Naga says:

    Can we leverage existing ec2 instance for running shell script ?

  2. andavarrajanr says:

    do you have any source code for the above described task using aws api.

  3. andavarrajanr says:

    is there any link to learn to automate the above task programatically using AWS Datapipeline APIs

  4. andavarrajanr says:

    Hi, im getting following errors after following the above steps.
    They are
    Key pair does not exist.
    image id does not exist.

    But i have both of them and able to connect the instance created by the image and key using putty.

  5. andavarrajanr says:

    I have fixed the above problem. It is because of region mismatch.

    I have doubts in preparing bash file. Is there any lines required along with the 2 lines,

    like #!/bin/bash

  6. moh says:

    outstanding , this blog really helped. I have a one question though : I have a standalone non-emr cluster that i need to execute shell script on through a driver machine connected to the cluster. do i need a task runner to be installed on the spark cluster ?

  7. Saida says:

    Hi,

    I was not able to copy the file/jar from S3 to EC2 instance. It is giving an error as below:
    amazonaws.datapipeline.taskrunner.TaskExecutionException: usage: aws s3 operation
    aws s3: error: argument operation: invalid choice: ‘cp’ (choose from u’list-object-versions’, u’put-bucket-policy’, u’list-objects’, u’put-bucket-website’, u’put-bucket-notification’, u’put-bucket-logging’, u’upload-part’, u’put-object’, u’delete-bucket-cors’, u’put-bucket-versioning’, u’get-bucket-cors’, u’put-bucket-lifecycle’, u’get-bucket-acl’, u’get-bucket-logging’, u’head-bucket’, u’put-bucket-acl’, u’delete-bucket-website’, u’delete-bucket-policy’, u’delete-objects’, u’get-object’, u’copy-object’, u’list-buckets’, u’put-bucket-request-payment’, u’head-object’, u’delete-bucket-tagging’, u’get-object-torrent’, u’get-bucket-lifecycle’, u’create-bucket’, u’complete-multipart-upload’, u’get-bucket-website’, u’create-multipart-upload’, u’delete-bucket’, u’get-bucket-policy’, u’get-bucket-versioning’, u’list-multipart-uploads’, u’get-bucket-request-payment’, u’put-bucket-tagging’, u’get-bucket-tagging’, u’abort-multipart-upload’, u’put-object-acl’, u’get-bucket-location’, u’put-bucket-cors’, u’delete-bucket-lifecycle’, u’get-bucket-notification’, u’list-parts’, u’get-object-acl’, u’upload-part-copy’, u’delete-object’, u’restore-object’, ‘help’)

    • ced says:

      I had a similar problem. needed to use a newer ami with the newest aws cli. the default ami is a 2013 aws Linux and runs some old libs

  8. Pingback: Confluence: Behavior Analytics

  9. Pingback: Confluence: Data Engineering

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: