Using Amazon RedShift with the AWS .NET API Part 8: data warehousing and the star schema 2
April 9, 2015 1 Comment
Introduction
In the previous post we discussed the basics of data warehousing and the different commonly used database schemas associated with it. We also set up a couple of tables: one raw data table which we filled with some raw data records, two dimension tables and a fact table.
In this post we’ll build upon the existing tables and present a couple of useful Postgresql statements in RedShift. Keep in mind that Postgresql in RedShift is very limited compared to the full version so you often need to be resourceful.
Fill in the dimension tables
Recall that we have 2 dimension tables: DimUrl and DimCustomer. Both are referenced from the fact table by their primary keys. We haven’t added any data into them yet. We’ll do that now.
Our raw data records in the url_response_times table are limited to 3 different URLs and 2 customers. So in fact it would be a piece of cake to add them to the dimension tables manullay. However, we’d prefer some more dynamic solution. We cannot revisit these tables every time we have a new URL or a new Customer. Ideally any new URL and Customer should be inserted into these tables latest before any aggregation occurs. If this step is forgotten then the fact table will not show any data for those as their IDs are missing.
This is actually quite easy to achieve. We just need a script which checks if the raw data includes any URLs or Customers that haven’t been added into DimUrl and DimCustomer before. In other words we need to add the new unique values to those dimension tables.
The biggest challenge is setting the correct ID. Recall that there’s no auto-incrementing ID in RedShift. Also, if you enter two records with the same primary key then RedShift won’t complain. Furthermore, if you reference non-existent secondary ID from a table then RedShift won’t complain either. E.g. if FactAggregationUrl points to an URL by ID 43 and if there’s no URL with ID 43 in DimUrl then RedShift won’t give you any warning. Primary key IDs in RedShift are not used for data consistency but query execution planning.
However, there’s usually some solution to achieve something similar to what you need even if it’s not exactly the same. In this case you might have required auto-incrementing unique IDs which is not possible. The next best solution is to have unique IDs even if they are not incremented by 1 with every record. We can base the ID uniqueness on other parts of record. E.g. all URLs are unique. Also, you’ll probably distinguish your customers by name, not just some ID. Hence we can safely assume that customer names will also be unique. A combination of the following Postgresql methods will do the trick:
We have now built up enough tension for a good movie so let’s see the statement that will add all new unique records to DimUrl from url_response_times:
insert into DimUrl (dw_id, url) select distinct MOD(STRTOL(CRC32(LOWER(url)),16), 2147483647), url from url_response_times R where not exists (select 1 from DimUrl DU where DU.url = R.url);
If you check the contents of DimUrl then you’ll see the following records:
dw_id url inserted_utc
————————————————————-
1817319728 http://www.mysite.com/history 2015-02-21 20:05:01.0
1020831030 http://www.mysite.com/plans 2015-02-21 20:05:01.0
393838553 http://www.mysite.com/home 2015-02-21 20:05:01.0
The IDs are quite large but they are definitely unique.
Let’s see the equivalent statement for DimCustomer which is almost identical to the above:
insert into DimCustomer (dw_id, name) select distinct MOD(STRTOL(CRC32(LOWER(customer_name)),16), 2147483647), customer_name from url_response_times R where not exists (select 1 from DimCustomer DC where DC.name = R.customer_name);
DimCustomer has now 2 records:
dw_id url inserted_utc
————————————————————-
888820231 Favourite customer 2015-02-21 20:10:42.0
922476521 Nice customer 2015-02-21 20:10:42.0
The aggregation by URL and Customer
We’ve prepared our raw data for the aggregation phase. In the first version of the aggregation script we’ll ignore the median and percentile values as they are a bit complex in RedShift. We’ll simply set them to 0. We’ll first get the easier stuff out of the way and tackle the harder ones in the next post.
There are 3 URLs and 2 customers so we’re expecting to see 3*2=6 rows in the Fact table, i.e. one for each URL-Customer pair. We’ll first build a select statement without adding the data to the Fact table. We’ll do that when we have the the median and percentile calculations ready as well.
Here’s a SELECT statement to calculate min, max and average response time and the url count:
select u.dw_id as url_id , c.dw_id as customer_id , count(u.dw_id) as url_count , MIN(response_time) as min_response_time , MAX(response_time) as max_response_time , AVG(response_time) as avg_response_time ,0 as median ,0 as perc_95 ,0 as perc_99 from url_response_times d, DimUrl u, DimCustomer c where d.url = u.url and d.customer_name = c.name group by u.dw_id , c.dw_id;
This is all standard SQL, there’s nothing specific about Postgresql here. We select the IDs from the dimension tables, use the standard COUNT, MIN, MAX and AVG functions to calculate some basic aggregates and group the results by the IDs so that we get every possible combination.
The script gives us the following records:
url_id customer_id url_count min max avg med 95th 99th
————————————————————————————-
393838553 922476521 3 823 1937 1235 0 0 0
1817319728 922476521 3 1536 3256 2193 0 0 0
1020831030 922476521 3 967 2643 2052 0 0 0
1020831030 888820231 3 812 1532 1270 0 0 0
393838553 888820231 3 354 745 525 0 0 0
1817319728 888820231 3 735 1764 1104 0 0 0
Let’s check the first row. Url_id 393838553 refers to http://www.mysite.com/home and customer_id 922476521 refers to Nice customer. Let’s see what we have for this combination of data in our raw data table:
select * from url_response_times where customer_name = 'Nice customer' and url='www.mysite.com/home';
…which gives
url customer_name response_time
—————————————————–
http://www.mysite.com/home Nice customer 946
http://www.mysite.com/home Nice customer 823
http://www.mysite.com/home Nice customer 1937
We can see that min = 823, max = 1937 and that there were 3 url calls for this URL and from this customer. The average response time is 1235.33333 which was rounded to 1235.
You can check the validity of the other rows that the SELECT statement calculated, they should all be correct.
We’ll stop here for now. In the next post we’ll concentrate on the missing aggregations, i.e. median and 95th and 99th percentiles.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on Dinesh Ram Kali..