Using Amazon RedShift with the AWS .NET API Part 8: data warehousing and the star schema 2


In the previous post we discussed the basics of data warehousing and the different commonly used database schemas associated with it. We also set up a couple of tables: one raw data table which we filled with some raw data records, two dimension tables and a fact table.

In this post we’ll build upon the existing tables and present a couple of useful Postgresql statements in RedShift. Keep in mind that Postgresql in RedShift is very limited compared to the full version so you often need to be resourceful.

Fill in the dimension tables

Recall that we have 2 dimension tables: DimUrl and DimCustomer. Both are referenced from the fact table by their primary keys. We haven’t added any data into them yet. We’ll do that now.

Our raw data records in the url_response_times table are limited to 3 different URLs and 2 customers. So in fact it would be a piece of cake to add them to the dimension tables manullay. However, we’d prefer some more dynamic solution. We cannot revisit these tables every time we have a new URL or a new Customer. Ideally any new URL and Customer should be inserted into these tables latest before any aggregation occurs. If this step is forgotten then the fact table will not show any data for those as their IDs are missing.

This is actually quite easy to achieve. We just need a script which checks if the raw data includes any URLs or Customers that haven’t been added into DimUrl and DimCustomer before. In other words we need to add the new unique values to those dimension tables.

The biggest challenge is setting the correct ID. Recall that there’s no auto-incrementing ID in RedShift. Also, if you enter two records with the same primary key then RedShift won’t complain. Furthermore, if you reference non-existent secondary ID from a table then RedShift won’t complain either. E.g. if FactAggregationUrl points to an URL by ID 43 and if there’s no URL with ID 43 in DimUrl then RedShift won’t give you any warning. Primary key IDs in RedShift are not used for data consistency but query execution planning.

However, there’s usually some solution to achieve something similar to what you need even if it’s not exactly the same. In this case you might have required auto-incrementing unique IDs which is not possible. The next best solution is to have unique IDs even if they are not incremented by 1 with every record. We can base the ID uniqueness on other parts of record. E.g. all URLs are unique. Also, you’ll probably distinguish your customers by name, not just some ID. Hence we can safely assume that customer names will also be unique. A combination of the following Postgresql methods will do the trick:

We have now built up enough tension for a good movie so let’s see the statement that will add all new unique records to DimUrl from url_response_times:

insert into DimUrl (dw_id, url)
select distinct
 MOD(STRTOL(CRC32(LOWER(url)),16), 2147483647),
 from url_response_times R
    not exists (select 1 from DimUrl DU where
    DU.url = R.url); 

If you check the contents of DimUrl then you’ll see the following records:

dw_id url inserted_utc
1817319728 2015-02-21 20:05:01.0
1020831030 2015-02-21 20:05:01.0
393838553 2015-02-21 20:05:01.0

The IDs are quite large but they are definitely unique.

Let’s see the equivalent statement for DimCustomer which is almost identical to the above:

insert into DimCustomer (dw_id, name)
select distinct
 MOD(STRTOL(CRC32(LOWER(customer_name)),16), 2147483647),
 from url_response_times R
    not exists (select 1 from DimCustomer DC where = R.customer_name);   

DimCustomer has now 2 records:

dw_id url inserted_utc
888820231 Favourite customer 2015-02-21 20:10:42.0
922476521 Nice customer 2015-02-21 20:10:42.0

The aggregation by URL and Customer

We’ve prepared our raw data for the aggregation phase. In the first version of the aggregation script we’ll ignore the median and percentile values as they are a bit complex in RedShift. We’ll simply set them to 0. We’ll first get the easier stuff out of the way and tackle the harder ones in the next post.

There are 3 URLs and 2 customers so we’re expecting to see 3*2=6 rows in the Fact table, i.e. one for each URL-Customer pair. We’ll first build a select statement without adding the data to the Fact table. We’ll do that when we have the the median and percentile calculations ready as well.

Here’s a SELECT statement to calculate min, max and average response time and the url count:

	u.dw_id as url_id		
	, c.dw_id as customer_id
	, count(u.dw_id) as url_count
	, MIN(response_time) as min_response_time
	, MAX(response_time) as max_response_time
	, AVG(response_time) as avg_response_time
	,0 as median
	,0 as perc_95
	,0 as perc_99
	url_response_times d,	
  DimUrl u,  
  DimCustomer c  
  d.url = u.url 
  	and d.customer_name =			
group by 
	, c.dw_id;  

This is all standard SQL, there’s nothing specific about Postgresql here. We select the IDs from the dimension tables, use the standard COUNT, MIN, MAX and AVG functions to calculate some basic aggregates and group the results by the IDs so that we get every possible combination.

The script gives us the following records:

url_id customer_id url_count min max avg med 95th 99th
393838553 922476521 3 823 1937 1235 0 0 0
1817319728 922476521 3 1536 3256 2193 0 0 0
1020831030 922476521 3 967 2643 2052 0 0 0
1020831030 888820231 3 812 1532 1270 0 0 0
393838553 888820231 3 354 745 525 0 0 0
1817319728 888820231 3 735 1764 1104 0 0 0

Let’s check the first row. Url_id 393838553 refers to and customer_id 922476521 refers to Nice customer. Let’s see what we have for this combination of data in our raw data table:

select * from url_response_times where customer_name = 'Nice customer' and url='';

…which gives

url customer_name response_time
—————————————————– Nice customer 946 Nice customer 823 Nice customer 1937

We can see that min = 823, max = 1937 and that there were 3 url calls for this URL and from this customer. The average response time is 1235.33333 which was rounded to 1235.

You can check the validity of the other rows that the SELECT statement calculated, they should all be correct.

We’ll stop here for now. In the next post we’ll concentrate on the missing aggregations, i.e. median and 95th and 99th percentiles.

View all posts related to Amazon Web Services and Big Data here.


About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

One Response to Using Amazon RedShift with the AWS .NET API Part 8: data warehousing and the star schema 2

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


A great site

Elliot Balynn's Blog

A directory of wonderful thoughts

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: