Using Amazon RedShift with the AWS .NET API Part 2: MPP definition and first cluster

Introduction

In the previous post we went through Amazon RedShift at an introductory level. In general we can say that it is a highly efficient data storage and data mining tool especially suited for Big Data scenarios. However, it also comes with serious limitations regarding the available Postgresql language features.

In this post we’ll first summarise an important term in conjunction with RedShift: MPP. We’ll then go on and create our first database in the RedShift GUI.

Redshift as an MPP database

MPP stands for “massively parallel processing” and RedShift is an example of an MPP database. Although RedShift is a tabular database like e.g. MS SQL or Oracle, it was especially engineered towards query efficiency and fast calculations over large data sets. We can say that RedShift is a transition between distributed computing engines with loosely structured data sources such as Hadoop and traditional SQL databases with a large set of features.

Like we saw in the series on Elastic MapReduce distributed systems have a master node and one or more slave nodes or data nodes. RedShift follows a similar architecture so we’ll be talking about clusters again, not a single database server. You can read more about the RedShift architecture on its homepage and on Wikipedia here and here. There’s no point in regurgitating all that in this post. In summary, however, we can say the following about MPP systems:

  • The main interface for communication between the user and the MPP system is the master node
  • The master node doesn’t store any data but rather creates a query plan for queries and keeps track of where the data records the stored, i.e. on which data node, usually using a hashtable.
  • Each data node is an independent system with no connection to other nodes. Therefore they don’t share any data between them. The full data set is distributed across the data nodes. You may be able to configure how to distribute the records across the data nodes but the default usually follows a round-robin mode. The first record in a data set to be inserted is saved on Node A, the second on Node B, the third on Node C, the fourth on Node A again, assuming we have three data nodes. The data set will be evenly distributed across the data nodes this way. An additional benefit is that as each data node only holds a portion of the full set the queries will run a lot faster as they only need to go through a subset of data.
  • If the query needs to be executed across more than one data node – e.g. if you’re not just searching by ID – then the query is split into multiple subqueries by the master node and each subquery is executed in parallel on each data node which again hold only a subset of the full data set. The partial results of the subqueries will be joined before being returned to the user. So with complex queries the full data set may be scanned but they will still run a lot faster than a single query which scans the entire data set, like in the case of traditional databases. As the data nodes do not share any infrastructure you don’t need to worry about data contention either.
  • You can add as many data nodes as you wish to the system making it highly scalable. You can linearly scale up your data nodes by adding data node servers to the system that follow the same specifications regarding disk size, RAM and CPU. As a result the data set will be even more thinly distributed across the nodes and the queries will run faster. It will obviously be more expensive as you’ll have to pay for that extra node but it may be worth the investment, that depends on your scenario.
  • Each ingredient in an MPP system is a separate and independent physical element

Enough of the talk, let’s see some action.

A RedShift database

Log onto the AWS console at https://aws.amazon.com/ and select the RedShift menu point:

RedShift in AWS GUI menu

Probably every service you use with AWS has a region that you can select in the top right section of the UI:

DynamoDb region selector

Note that in your case a different region might be pre-selected so don’t get confused by the fact that it says “Singapore” in the above screenshot. Click the down-pointing arrow to view all the available regions. These regions are significant for all services with a couple of exceptions. E.g. S3, which we discussed in this series, is global and has less regional significance. In the case of RedShift when you create a new database cluster then that cluster will be available in the selected region. It doesn’t, however, mean that users cannot access the cluster located in Ireland from Australia. However, it will take Australian users a bit more time to complete the database operations than it does for a user in the UK. Also, we’ll see later that the region must be specified in code when configuring the access to AWS otherwise you may be wondering why your cluster cannot be located. Therefore I recommend you select the region nearest to your physical location. I’ll go with EU Ireland.

If this is the first time you use RedShift then you’ll probably see the below button to create your first cluster:

Launch cluster button in RedShift

Click that button and you’ll come to the first set of options:

Create new RedShift cluster first screen

Enter a name for the cluster identifier like “a-test-cluster”, “urls” for the database name and enter a username and password for the master node. Click Continue. The next screen will allow you to set the size for the cluster. We’ll be happy with the default values for evaluation purposes, i.e. we’ll have a single compute node in the cluster. Click Continue. The next screen will let you set further options like database encryption, networking, security and alarms. Select “no” for the basic CloudWatch alarm, our test usage will definitely not take up any significant resources. Click Continue to review the selections and the expected cost per hour and node. If you’re ready then click Launch Cluster. The cluster will be in creating status at first:

Amazon RedShift cluster being created

…and after a short while it will be “available”:

Amazon RedShift cluster available

Click the arrow to the left of the cluster ID to view the cluster details:

Amazon RedShift cluster details minor edition

You can click on the cluster name to view even more details, modify the cluster and create a backup:

Amazon RedShift cluster details major edition

That’s it, your MPP database cluster is up and running and is ready to accept queries.

We’ll access the master node in the next post and start writing some basic Postgresql commands.

View all posts related to Amazon Web Services and Big Data here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

5 Responses to Using Amazon RedShift with the AWS .NET API Part 2: MPP definition and first cluster

  1. Brian Dead Rift Webb says:

    Reblogged this on Brian By Experience.

  2. Min says:

    these posts do me a great favor. Thanks!

  3. Yuri says:

    Hello, thanks for these series. I have a question: can the above cluster creation and other tooling Redshift tasks be accomplished using C#.NET? I know they can all be done using Redshift CLI and also by Java and Python and C++. But I would like to do these using C#.NET from Windows (and even from .NET Core on Linux). Is this possible and supported?? THANKS.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

Elliot Balynn's Blog

A directory of wonderful thoughts

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

My goal with this blog is to offend everyone in the world at least once with my words… so no one has a reason to have a heightened sense of themselves. We are all ignorant, we are all found wanting, we are all bad people sometimes.

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: