An example of using ShellCommandActivity on Amazon Data Pipeline
November 9, 2014 13 Comments
Introduction
Amazon Data Pipeline helps you automate recurring tasks and data import/export in the AWS environment.
In this post we’ll go through a very specific example of using Data Pipeline: run an arbitrary JAR file from an EC2 instance through a bash script. This may not be something you do every single day but I really could have used an example when I went through this process in a recent project.
The scenario is the following:
- You are working on a project within the Amazon web services environment
- You have a compiled JAR file saved on S3
- The JAR file can carry out ANY activity – it can range from printing “Hello world” to the console window to a complex application that interacts with databases and/or other Amazon components to perform some composite action
- You’d like to execute this file automatically with logging and retries
In that case Data Pipeline is an option to consider. It has several so-called activity types, like CopyActivity, HiveActivity or RedShiftCopyActivity. I won’t go into any of these – I’m not sure how to use them and I’d like to concentrate on the solution to the problem outlined above.
Scripts
The activity type to pick in this case is ShellCommandActivity. It allows you to run a Linux bash script on an EC2 instance – or an Elastic MapReduce instance, but I didn’t see any use of that in my case. You’ll need at least 2 elements: the JAR file to be executed and a bash script which loads the JAR file onto the EC2 instance created by Data Pipeline and then executes it.
So say you have the following compiled Java application in S3:
The accompanying bash script is extremely simple but make sure you create it in a Linux-based editor or, if you want to edit the script in Windows, in a Windows-compatible bash script editor. Do not create the script in a Windows-based text editor like Notepad or Notepad++. The linefeed character won’t be properly recognised by the Linux EC2 instance trying to run the script. You may see some strange behaviour such as the JAR file is downloaded but then it cannot be located.
Create a bash script with the following 2 rows:
aws s3 cp s3://bucket-for-blog/SimpleModelJarForDataPipeline.jar /home/ec2-user/SimpleModelJarForDataPipeline.jar
java -jar /home/ec2-user/SimpleModelJarForDataPipeline.jar
The first line calls upon the Amazon CLI to copy a file located on S3 into the /home/ec2-user/ folder on the generated EC2 machine. Data Pipeline will access the new EC2 instance under the default “ec2-user” username, i.e. not admin which can lead to authorisation problems. E.g. if the ec2-user won’t be able to save the file to just any folder on the EC2 instance so it’s wise to select the default home directory of that user.
The second line then executes the JAR file with standard java -jar.
Save the script, upload it to S3 and take note of its URL, such as s3://scripts/taskrunner.sh
Setting up Data Pipeline
Then in the Data Pipeline console you can create a new pipeline as follows:
1. Click “Create new pipeline”:
2. Give it some name, description, a schedule and a bucket for the logs in the Create Pipeline window and click Create
3. A new screen will open where you can add Activities, data nodes and do some other stuff:
You’ll see a panel on the right hand side of the screen with headers like Activities, DataNodes, Schedules etc.
4. Click the Add activity button. This will add a new activity with some default name like “DefaultActivity1” and the Activities section will open automatically.
5. Give the activity some name, select ShellCommandActivity as the type, the Schedule drop down should be populated with a name based on what type of schedule you created in the Create Pipeline window.
6. In the Add an optional field… drop-down select Script Uri and enter the S3 location of the bash script we created above.
7. In the Add an optional field… drop-down select Runs On. This will open a new drop-down list, select “Create new: Resource”. This will create a new Resource for you under the Resources tab although this is not visible for you at first. It will get the default name “DefaultResource1”.
8. Expand the Schedules tab and modify the schedule if necessary
9. Expand the Resources tab. Add the resource some name instead of “DefaultResource1”. This will automatically overwrite the resource name in the activity you created in step 7.
10. For the type select Ec2Resource. This will populate the Role and Resource Role drop down lists to DataPipelineDefaultRole and DataPipelineDefaultResourceRole. This means that the EC2 resource will execute the job with the rights defined for the DataPipelineDefaultResourceRole. We’ll come back to this a little later. You can leave these values as they are or change to a different role available among the drop-down values.
11. Add the following optional fields:
- Instance type: provide the size of the EC2 instance that should execute the job, such as m1.small. m1.micro is often enough as the EC2 resource will only be used to execute a JAR and not host some large website.
- Region: select the region where the EC2 instance will be spawned. It’s good to select the same region as where you have set up the pipeline
- Terminate After: a safety measure so that the EC2 instance doesn’t run for ever, specify e.g. 30 minutes
- Key Pair: enter the name of the key pair. You’ll probably have at least one key-pair if you’re already using EC2 in Amazon. Otherwise you can find the instructions here: . Just enter the name of the key as you saved it like “DP key” or “Elvis Presley”. Make sure you have access to this key i.e. you have exported it from Amazon and saved it somewhere safe. It’s a good idea to assign a key pair to the EC2 instance as you may need to log onto it with e.g. Putty during job execution to test your bash script.
- Image id: provide the AMI which will be the basis for the EC2 instance. This is quite an important field. If you don’t provide an AMI then Data Pipeline will select some default one. At the time of writing this post the default AMI in us-east-1 was “ami-05355a6c”. However, that instance doesn’t install Java and the latest AWS CLI on the EC2 so our bash script will inevitably fail. Hence first create an AMI which installs these tools on EC2 during the creation process.
That’s it, click Save pipeline. DP will probably complain about some validation exceptions. Review them under Errors/Warnings. Example messages:
- Insufficient permission to describe key pair
- Insufficient permission to describe image id
- resourceRole ‘…’ has insufficient permissions to run datapipeline due to…
This last message is followed by a long range of missing role types. Frankly, I don’t know why these messages appear and how to make them go away, but I simply chose to ignore them and the pipeline will still work.
Then click Save pipeline and you should be good to go. There will be stderr and stdout messages to review any messages and exceptions during the JAR file execution.
Before we finish here’s one tip regarding the DataPipelineDefaultResourceRole role. If your JAR file accesses other AWS resources, such as DynamoDb or S3, then it may fail. Review the stderr output after the job has been executed, you may see something similar:
You see that DataPipelineDefaultResourceRole has no rights to execute the ListClusters action on an Elastic MapReduce cluster. In this case you need to extend the permissions of the role in the IAM console. Click “Roles” on the left hand panel, select DataPipelineDefaultResourceRole and then click “Manage Policy”:
You’ll see a list of permissions as JSON. In the above case I would extend the JSON with the following:
“elasticmapreduce:ListClusters”
…i.e. exactly as it said in the exception message.
Depending on your exception you may need to add something else, like “dynamodb:Scan” or “cloudwatch:PutMetricData”.
View all posts related to Amazon Web Services here.