Using Amazon RedShift with the AWS .NET API Part 3: connecting to the master node

Introduction

In the previous post of this series we quickly looked at what a massively parallel processing database is. We also launched our first Amazon RedShift cluster.

In this post we’ll connect to the master node and start issuing Postgresql commands.

If you don’t have any RedShift cluster available at this point then you can follow the steps in the previous post so that you can try the example code.

Connecting to RedShift

There’s no freely available GUI specifically made for RedShift or even a standard one, like SQL Management Studio for MS SQL. We’ll use a generic DB tool called SQL WorkbenchJ to connect to the master node. If you have your cluster ready you can go through the following documentation on how to download and configure WorkbenchJ to connect to the master node. I used the JDBC connection steps under the “Connect to Your Cluster over a JDBC Connection in SQL Workbench/J” section: here‘s the how-to page.

Go through the entire process and connect to the master node using the JDBC connection string, the master node username and password. The JDBC connection string will include the name of the cluster such as “a-first-test” and the name of the database “urls” that we created. It’s important to note that there’s no easy way to switch the database context in Postgresql like you would in MS SQL with “use ‘database_name'”. If you want to switch to another database you’ll need to reconnect with the database name modified in the connection string.

If you get stuck you’re welcome to ask in the comments section below. At the end you should have an empty tab open in the WorkbenchJ GUI ready to write some SQL commands.

Postgresql basics

There’s a load of documentation available on the Postgresql docs page for different versions of the language. However, as mentioned before the RedShift version of Postgresql is very limited. Therefore another important documentation source is the RedShift language manual. You’ll find several links to important pages there that describe the differences between standard Postgresql and its RedShift implementation.

OK, let’s start. I’ll go through similar commands like in this post from the Elastic MapReduce series. We’ll simulate a very basic scenario where we’re measuring the response times of URLs and want to calculate an average by URL.

We’ll first check what databases are available in our cluster:

SELECT datname FROM pg_database WHERE datistemplate = false;

It gave me 3 databases: urls, dev and one called padb_harvest. We created urls in the previous post, dev is created for you automatically and I haven’t been able to figure out what padb_harvest is or does but we probably don’t need to worry about it too much.

So keep in mind that “urls” is the current database context so we don’t need to switch to it. All DB specific commands will be directed at that database.

Let’s create our first table:

create table if not exists url_response_times (url varchar, response_time int);

Let’s check the list of tables with the following command:

select * from information_schema.tables where table_schema = 'public' order by table_name;

It will show you the table we’ve just created.

Let’s insert a couple of records into the table:

insert into url_response_times(url, response_time) values ('www.facebook.com', 512);
insert into url_response_times(url, response_time) values ('www.facebook.com',742);
insert into url_response_times(url, response_time) values ('www.yahoo.com',983);
insert into url_response_times(url, response_time) values ('www.yahoo.com',1532);
insert into url_response_times(url, response_time) values ('www.facebook.com',783);
insert into url_response_times(url, response_time) values ('www.facebook.com',912);
insert into url_response_times(url, response_time) values ('www.yahoo.com',437);
insert into url_response_times(url, response_time) values ('www.yahoo.com',1764);
insert into url_response_times(url, response_time) values ('www.yahoo.com',142);
insert into url_response_times(url, response_time) values ('www.facebook.com',824);
insert into url_response_times(url, response_time) values ('www.facebook.com',381);
insert into url_response_times(url, response_time) values ('www.yahoo.com',798);
insert into url_response_times(url, response_time) values ('www.yahoo.com',725);
insert into url_response_times(url, response_time) values ('www.yahoo.com',236);
insert into url_response_times(url, response_time) values ('www.facebook.com',315);
insert into url_response_times(url, response_time) values ('www.facebook.com',853);

Let’s make sure that the records have been entered:

select * from url_response_times;

Indeed, they have:

URL response times entered in RedShift urls database

We’ll create another table which will hold the aggregations:

create table if not exists url_aggregations (url varchar, avg_response_time double precision);

We’ll calculate the response times averages and insert them into the above table as follows:

insert into url_aggregations (url, avg_response_time)
select url, avg(response_time) from url_response_times group by url;

We use the standard “avg” function to calculate the average response times.

Let’s check the records in url_aggregations:

select * from url_aggregations;

…and there you have it:

Basic url response time averages in Amazon RedShift aggregation table

That’s wasn’t too difficult. You can see that Postgresql is not too different from MS SQL. As this is not a course dedicated to Posgresql I won’t go into too much detail on that front.

This is enough for now. We’ll start looking into the RedShift sections of the .NET AWS SDK in the next post.

View all posts related to Amazon Web Services and Big Data here.

Advertisement

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

2 Responses to Using Amazon RedShift with the AWS .NET API Part 3: connecting to the master node

  1. Oyedepo Olufemi says:

    your articles are great but i’m still a bit confused concerning the usage of redshift + postgresql. i thought redshift on its own is a RDBMS available on AWS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: