Big Data: a summary of various Amazon Big Data tools

Introduction

We have gone through a lot of material about Big Data on this blog. This post summarises each Amazon Cloud component one by one, what they do and what their roles are in a Big Data architecture.

The components

Read more of this post

Big Data series: a summary of Amazon Big Data tools we have discussed

Introduction

We have gone through a lot of material about Big Data on this blog. This post summarises each Amazon Cloud component one by one, what they do and what their roles are in a Big Data architecture.

The components

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 10: RedShift in Big Data

Introduction

In the previous post we discussed how to calculate the more complex parts of the aggregation script: the median and nth percentile if the URL response time.

This post will take up the Big Data thread where we left off at the end of the series on Amazon S3. We’ll also refer to what we built [at the end of the series on Elastic MapReduce]. That post took up how to run an aggregation job via the AWS .NET SDK on an available EMR cluster. Therefore the pre-requisite of following the code examples in this post is familiarity with what we discussed in those topics.

In this post our goal is to show an alternative to EMR. We’ll also see how to import the raw data source from S3 into RedShift.

Read more of this post

Advertisement

Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

Introduction

In the previous post we started formulating a couple of Postgresql statements to fill in the dimension tables and the aggregation values. We saw that it wasn’t particularly difficult to calculate some basic aggregations over combinations of URL and Customer. We ignored the calculation of the median and percentile values and set them to 0. I’ve decided to dedicate a post just for those functions as I thought they were a lot more complex than min, max and average.

Median in RedShift

Median is also a percentile value, it is the 50th percentile. So we could use the percentile function for the median as well but median has its own dedicated function in RedShift. It’s not a compact function, like min() where you can pass in one or more arguments and you get a single value.

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 8: data warehousing and the star schema 2

Introduction

In the previous post we discussed the basics of data warehousing and the different commonly used database schemas associated with it. We also set up a couple of tables: one raw data table which we filled with some raw data records, two dimension tables and a fact table.

In this post we’ll build upon the existing tables and present a couple of useful Postgresql statements in RedShift. Keep in mind that Postgresql in RedShift is very limited compared to the full version so you often need to be resourceful.

Fill in the dimension tables

Recall that we have 2 dimension tables: DimUrl and DimCustomer. Both are referenced from the fact table by their primary keys. We haven’t added any data into them yet. We’ll do that now.

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 7: data warehousing and the star schema

Introduction

In the previous post we dived into Postgresql statement execution on a RedShift cluster using C# and ODBC. We saw how to execute a single statement or many of them at once. We also tested a parameterised query which can protect us from SQL injections.

In this post we’ll deviate from .NET a little and concentrate on the basics of data warehousing and data mining in RedShift. In particular we’ll learn about a popular schema type often used in conjunction with data mining: the star schema.

Star and snowflake schemas

I went through the basic characteristics of star and snowflake schemas elsewhere on this blog, I’ll copy the relevant parts here.

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 6: Postgresql to master node using ODBC

Introduction

In the previous post we tested how to connect to the master node in code using the .NET AWS SDK and ODBC. We also executed our first simple Postgresql remotely. In this post we’ll continue in those tracks and execute some more Postgresql statements on our master node.

Preparation

We’ll execute most of the scripts we saw in this blog post. Prepare a text file called postgresscript.txt with the following content and save it somewhere on your harddrive:

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 5: connecting to master node using ODBC

Introduction

In the previous post we went through some basic C# code to communicate with Amazon RedShift. We saw how to get a list of clusters, start a new cluster and terminate one using the .NET AWS SDK.

We haven’t yet seen how to execute Postgresql commands on RedShift remotely from code. That is the main goal of this post.

Installing the ODBC driver

In this section we’ll prepare our Windows environment to be able to connect to RedShift using ODBC. At times this can be a frustrating experience so I’ll try to give you as much detail as I can.

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 4: code beginnings

Introduction

In the previous post we looked into how to connect to the Amazon RedShift master node using a tool called WorkBenchJ. We also went through some very basic Postgresql statements and tested an equally basic aggregation script.

In this post we’ll install the .NET SDK and start building some test code.

Note that we’ll be concentrating on showing and explaining the technical code examples related to AWS. We’ll ignore software principles like SOLID and layering so that we can stay focused. It’s your responsibility to organise your code properly. There are numerous posts on this blog that take up topics related to software architecture.

Installing the SDK

Read more of this post

Using Amazon RedShift with the AWS .NET API Part 3: connecting to the master node

Introduction

In the previous post of this series we quickly looked at what a massively parallel processing database is. We also launched our first Amazon RedShift cluster.

In this post we’ll connect to the master node and start issuing Postgresql commands.

If you don’t have any RedShift cluster available at this point then you can follow the steps in the previous post so that you can try the example code.

Connecting to RedShift

Read more of this post

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: