Using Amazon RedShift with the AWS .NET API Part 1: introduction
March 16, 2015 3 Comments
Introduction
Amazon RedShift is Amazon’s data warehousing solution and is especially well-suited for Big Data scenarios where petabytes of data must be stored and analysed. It follows a columnar DBMS architecture and it was designed especially for heavy data mining requests.
This is the fifth – and probably for a while the last – installment of a series dedicated to out-of-the-box components built and powered by Amazon Web Services (AWS) enabling Big Data handling. The components we have looked at so far are the following:
- Message handler Amazon Kinesis
- Message persistence Amazon S3
- Aggregation data persistence Amazon DynamoDb
- The Hadoop-based data mining and analysis engine Elastic MapReduce
You can use RedShift to build a large distributed database cluster and use it without using any other Amazon component. However, the scope of this series is not to build a large cluster with petabytes of data. The primary goals are the following:
- How to use RedShift in code with the AWS .NET SDK
- How to log into a RedShift database and try some basic commands there using Postgresql
- Show examples of how RedShift can be used together with Amazon S3 to load data
- Show how RedShift can fit into a larger Amazon-based Big Data architecture
- Show a basic example of a star-schema which is especially well-suited for data mining aggregations
In summary the goal is to introduce RedShift to those who haven’t used it so far and are looking for an out-of-the-box cloud based Big Data aggregation and data mining tool. You can quickly set up a database cluster using RedShift without worrying about servers, maintenance, security etc. You can very quickly start evaluating the tool and don’t need to spend much time on the set-up phase. In other words, you can concentrate on solving the real issues instead.
As hinted at above we’ll not only use C# to communicate with the RedShift cluster – or its master node to be exact. It’s not possible to send C# commands to a RedShift database, they can’t be interpreted there of course. RedShift is based on PostgreSQL with some RedShift specific additions. PostgreSQL syntax is very similar to other SQL languages you may be familiar with such as MS SQL. Therefore learning the basics of the language doesn’t require a lot of time and effort and you can become productive quite fast. If you haven’t been exposed to Postgresql then you’ll at least learn something new. In fact, the bulk of this series is made up of code other than C# but we need to lay some groundwork first.
Prerequisites
You’ll need to have at least a trial account in Amazon Web Services if you want to try the code examples yourself. RedShift is not eligible for AWS free-tier at the time of writing this post but a free trial mode is available. Also, Amazon is bringing down its prices on AWS components quite often as their volumes grow larger and RedShift is no exception. This page shows you some pricing details.
You’ll also need the necessary AWS access keys: an Amazon Access Key and a Secret Access Key. You’ll need to sign up for RedShift in order to create a user account. You can create an account on the RedShift home page. You’ll see a large yellow button on the right hand side of the screen: Start free trial.
By signing up with Amazon and creating a user you’ll also get a pair of security keys: an Amazon Access Key and a Secret Access Key. Keep them somewhere safe.
Amazon has a lot of great documentation online. Should you get stuck you’ll almost always find an answer there. Don’t be afraid to ask in the comments section below.
I’m aiming to go through RedShift assuming that you haven’t read anything else on this blog, i.e. you haven’t followed along the larger Big Data series. We’ll only return to our small demo of URL response time aggregations towards the end of the series.
RedShift as an independent component
Often, AWS tools can be used in isolation, i.e. without the need to access other AWS tools. We’ve seen examples of those in this series as well: Kinesis for messaging, S3 for blob data storage and DynamoDb as a NoSql database. RedShift is no exception. You can use it in isolation as a standalone distributed columnar database in an otherwise non-AWS architecture. It can be easily accessed using ODBC with any popular language that AWS created an SDK for. Insertions, queries etc. can all be carried out like you would with MS SQL or Oracle.
However, it also integrates very smoothly with other Amazon services. Therefore it’s highly recommended that you also have access to and have some knowledge of at least S3. We’ll use Amazon S3 to efficiently load the raw data into our RedShift tables later on.
Limitations
Having said all those fine words about RedShift, here comes a warning. The version of PostgreSQL employed on RedShift has some serious limitations compared to the full blown PostgreSQL. E.g. stored procedures, triggers, functions, auto-incrementing primary keys, enforced secondary keys etc. are NOT SUPPORTED at this time. RedShift is still optimised for aggregation functions but it’s a good idea to be aware of the limits. This page has links to the lists of all missing or limited features. You’ll need to be prepared to write set-based SQL logic instead of procedural SQL statements like many programmers would do. This page explains very well with examples where the differences lie. It can happen that you’re trying to implement a query that works on full-blown SQL engines and you’ll end up pulling your hair in frustration. However, there’s usually an acceptable solution so that you can still enjoy the benefits of RedShift as a whole.
Also, new Postgresql language features are constantly being added to RedShift so the list of limitations will probably get shorter. You can keep an eye on the “What’s new” page of AWS here where Amazon announce new features.
RedShift vs. Elastic MapReduce
If you’re not familiar with Elastic MapReduce you can skim through the first couple of posts on EMR referenced above in the introduction. RedShift and EMR can be both used for data mining and analysis purposes. You might want to evaluate the data aggregation capabilities of both tools before coming to a final decision.
According to the tests we’ve performed in our project RedShift aggregation jobs run faster. A lot faster. RedShift can also be used as the storage device for the aggregated data. Other platforms like web services or desktop apps can easily pull the data from a RedShift table. You can also store the aggregated data on EMR on the Hadoop file system but those are not as readily available to other external platforms.
Generally our conclusion is that if you have a set of predefined aggregations that must be performed periodically and automatically the RedShift is a better candidate. However, if you’re building a UI where the client can put together completely flexible queries then Hadoop is probably a better option. RedShift is more used for predefined aggregations – these aggregations can be quite granular but still finite. E.g. you might only support the following time based aggregations: 15mins, 30mins, 1hr, 6hrs, 24hrs, in which case RedShift-based aggregation jobs will be ideal. However, if the user can freely select the aggregation period along with other parameters then Hadoop, possibly with a language called Pig, is a better fit.
That’s enough for starters. We’ll look at some generic architecture of distributed database engines in the next post.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on Brian By Experience.
Hi Andras,
First of all, thank you for your excellent posts.
I am new to Big data processing.
I just learned EMR, following your example, and I am very impressed by it’s performance. But here in this article, you just said that Redshift is even faster than EMR?
I have a question. At work we have for input: 1. a dynamo table (or possibly redshift table) 2. a file in s3. Both inputs are 10 million records. We need to query and join both inputs and generate a text file with say 8 million records (1 line of text per customer) that we’ll put in S3 then sFtp to a third party.
Would you recommend using EMR or Redshift for this requirement?
Hi Francis, when we built our system we had active help from an AWS engineer. He quickly dismissed EMR for our needs and directed us towards RedShift. RedShift is built for very aggregations across data in various dimension tables. I think the answer to your question depends largely on how you can organise your data. Is it relatively regular and can be imported into “normal” database tables then it’s worth reading the posts on RedShift and test for yourself. Our data is very standard with well defined property names and data types. So even if the raw data was stored in S3 files they could be readily loaded into RedShift tables by a simple Postgres command. I don’t know out of the top of my head whether DynamoDb data can be exported into RedShift tables but it should be possible. So if you can easily organise your data into database table columns then I’d strongly suggest that you test RedShift for your aggregation purposes. In our case EMR didn’t even come close to RedShift.
//Andras