Using Amazon RedShift with the AWS .NET API Part 8: data warehousing and the star schema 2

Introduction

In the previous post we discussed the basics of data warehousing and the different commonly used database schemas associated with it. We also set up a couple of tables: one raw data table which we filled with some raw data records, two dimension tables and a fact table.

In this post we’ll build upon the existing tables and present a couple of useful Postgresql statements in RedShift. Keep in mind that Postgresql in RedShift is very limited compared to the full version so you often need to be resourceful.

Fill in the dimension tables

Recall that we have 2 dimension tables: DimUrl and DimCustomer. Both are referenced from the fact table by their primary keys. We haven’t added any data into them yet. We’ll do that now.

Our raw data records in the url_response_times table are limited to 3 different URLs and 2 customers. So in fact it would be a piece of cake to add them to the dimension tables manullay. However, we’d prefer some more dynamic solution. We cannot revisit these tables every time we have a new URL or a new Customer. Ideally any new URL and Customer should be inserted into these tables latest before any aggregation occurs. If this step is forgotten then the fact table will not show any data for those as their IDs are missing.

This is actually quite easy to achieve. We just need a script which checks if the raw data includes any URLs or Customers that haven’t been added into DimUrl and DimCustomer before. In other words we need to add the new unique values to those dimension tables.

The biggest challenge is setting the correct ID. Recall that there’s no auto-incrementing ID in RedShift. Also, if you enter two records with the same primary key then RedShift won’t complain. Furthermore, if you reference non-existent secondary ID from a table then RedShift won’t complain either. E.g. if FactAggregationUrl points to an URL by ID 43 and if there’s no URL with ID 43 in DimUrl then RedShift won’t give you any warning. Primary key IDs in RedShift are not used for data consistency but query execution planning.

However, there’s usually some solution to achieve something similar to what you need even if it’s not exactly the same. In this case you might have required auto-incrementing unique IDs which is not possible. The next best solution is to have unique IDs even if they are not incremented by 1 with every record. We can base the ID uniqueness on other parts of record. E.g. all URLs are unique. Also, you’ll probably distinguish your customers by name, not just some ID. Hence we can safely assume that customer names will also be unique. A combination of the following Postgresql methods will do the trick:

We have now built up enough tension for a good movie so let’s see the statement that will add all new unique records to DimUrl from url_response_times:

insert into DimUrl (dw_id, url)
select distinct
 MOD(STRTOL(CRC32(LOWER(url)),16), 2147483647),
 url
 from url_response_times R
 where 
    not exists (select 1 from DimUrl DU where
    DU.url = R.url); 

If you check the contents of DimUrl then you’ll see the following records:

dw_id url inserted_utc
————————————————————-
1817319728 http://www.mysite.com/history 2015-02-21 20:05:01.0
1020831030 http://www.mysite.com/plans 2015-02-21 20:05:01.0
393838553 http://www.mysite.com/home 2015-02-21 20:05:01.0

The IDs are quite large but they are definitely unique.

Let’s see the equivalent statement for DimCustomer which is almost identical to the above:

insert into DimCustomer (dw_id, name)
select distinct
 MOD(STRTOL(CRC32(LOWER(customer_name)),16), 2147483647),
 customer_name
 from url_response_times R
 where 
    not exists (select 1 from DimCustomer DC where
    DC.name = R.customer_name);   

DimCustomer has now 2 records:

dw_id url inserted_utc
————————————————————-
888820231 Favourite customer 2015-02-21 20:10:42.0
922476521 Nice customer 2015-02-21 20:10:42.0

The aggregation by URL and Customer

We’ve prepared our raw data for the aggregation phase. In the first version of the aggregation script we’ll ignore the median and percentile values as they are a bit complex in RedShift. We’ll simply set them to 0. We’ll first get the easier stuff out of the way and tackle the harder ones in the next post.

There are 3 URLs and 2 customers so we’re expecting to see 3*2=6 rows in the Fact table, i.e. one for each URL-Customer pair. We’ll first build a select statement without adding the data to the Fact table. We’ll do that when we have the the median and percentile calculations ready as well.

Here’s a SELECT statement to calculate min, max and average response time and the url count:

select
	u.dw_id as url_id		
	, c.dw_id as customer_id
	, count(u.dw_id) as url_count
	, MIN(response_time) as min_response_time
	, MAX(response_time) as max_response_time
	, AVG(response_time) as avg_response_time
	,0 as median
	,0 as perc_95
	,0 as perc_99
from
	url_response_times d,	
  DimUrl u,  
  DimCustomer c  
where	
  d.url = u.url 
  	and d.customer_name = c.name			
group by 
	u.dw_id	
	, c.dw_id;  

This is all standard SQL, there’s nothing specific about Postgresql here. We select the IDs from the dimension tables, use the standard COUNT, MIN, MAX and AVG functions to calculate some basic aggregates and group the results by the IDs so that we get every possible combination.

The script gives us the following records:

url_id customer_id url_count min max avg med 95th 99th
————————————————————————————-
393838553 922476521 3 823 1937 1235 0 0 0
1817319728 922476521 3 1536 3256 2193 0 0 0
1020831030 922476521 3 967 2643 2052 0 0 0
1020831030 888820231 3 812 1532 1270 0 0 0
393838553 888820231 3 354 745 525 0 0 0
1817319728 888820231 3 735 1764 1104 0 0 0

Let’s check the first row. Url_id 393838553 refers to http://www.mysite.com/home and customer_id 922476521 refers to Nice customer. Let’s see what we have for this combination of data in our raw data table:

select * from url_response_times where customer_name = 'Nice customer' and url='www.mysite.com/home';

…which gives

url customer_name response_time
—————————————————–
http://www.mysite.com/home Nice customer 946
http://www.mysite.com/home Nice customer 823
http://www.mysite.com/home Nice customer 1937

We can see that min = 823, max = 1937 and that there were 3 url calls for this URL and from this customer. The average response time is 1235.33333 which was rounded to 1235.

You can check the validity of the other rows that the SELECT statement calculated, they should all be correct.

We’ll stop here for now. In the next post we’ll concentrate on the missing aggregations, i.e. median and 95th and 99th percentiles.

View all posts related to Amazon Web Services and Big Data here.

Advertisement

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

One Response to Using Amazon RedShift with the AWS .NET API Part 8: data warehousing and the star schema 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Elliot Balynn's Blog

A directory of wonderful thoughts

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

WEB APPLICATION DEVELOPMENT TUTORIALS WITH OPEN-SOURCE PROJECTS

Once Upon a Camayoc

Bite-size insight on Cyber Security for the not too technical.

%d bloggers like this: