← Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 2: MaxMind source files

5 ways to concatenate strings with C# .NET →

Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

April 13, 2015 1 Comment

Introduction

In the previous post we started formulating a couple of Postgresql statements to fill in the dimension tables and the aggregation values. We saw that it wasn’t particularly difficult to calculate some basic aggregations over combinations of URL and Customer. We ignored the calculation of the median and percentile values and set them to 0. I’ve decided to dedicate a post just for those functions as I thought they were a lot more complex than min, max and average.

Median in RedShift

Median is also a percentile value, it is the 50th percentile. So we could use the percentile function for the median as well but median has its own dedicated function in RedShift. It’s not a compact function, like min() where you can pass in one or more arguments and you get a single value.

Let’s try and use the function to calculate the median over all values we currently have in url_response_times:

select median(response_time) over() from url_response_times;

The function will return 1217 for every row in the data set:

We can have a single value using the distinct keyword:

select distinct(median) from(select median(response_time) over() from url_response_times);

That gives 1217 which is the correct value for the full data set. However, that’s not really what we need. We’d like to see the median response time by URL and Customer. The following extended statement does the trick with the help of the PARTITION BY and GROUP BY clauses:

select distinct median
			(response_time)
		over (partition by url, customer_name) as median, url,customer_name from url_response_times
		group by 
			(response_time)
			, url, customer_name;

We now have the correct figures:

However, how do we add all that into our aggregation script? We need to turn the above function into a subset and reference it from the encasing script. We also need to extend the WHERE and GROUP BY clauses to take this subset into consideration:

select
	u.dw_id as url_id		
	, c.dw_id as customer_id
	, count(u.dw_id) as url_count
	, MIN(response_time) as min_response_time
	, MAX(response_time) as max_response_time
	, AVG(response_time) as avg_response_time
	,median as med
	,0 as perc_95
	,0 as perc_99
from
	url_response_times d,	
  DimUrl u,  
  DimCustomer c,
  (select distinct median
			(response_time)
		over (partition by url, customer_name) as median, url,customer_name from url_response_times
		group by 
			(response_time)
			, url, customer_name) med
where	
  d.url = u.url 
  	and d.customer_name = c.name			
  	and d.url = med.url
  	and d.customer_name = med.customer_name
group by 
	u.dw_id	
	, c.dw_id
	, med.median;

Let’s see what we have now:

Percentiles in RedShift

We need to go through a similar process with the percentiles as the RedShift percentile_cont is a window function with a couple of clauses on its own. The following statement will return the 95th percentile for each record in the table:

select perc from (select (response_time), percentile_cont (0.95) within group (order by (response_time)) over () as perc from url_response_times);

Like above we can have a single value using the distinct keyword:

select distinct(perc) from(select perc from (select (response_time), percentile_cont (0.95) within group (order by (response_time)) over () as perc from url_response_times));

…which returns 2734.95. Let’s see what this function will look like to view the percentiles by URL and Customer:

select distinct(percentile_95) AS percentile_95, url,customer_name from (select (response_time), percentile_cont (0.95) within group (order by (response_time)) over (partition by url, customer_name) as percentile_95, url, customer_name from url_response_times group by response_time, url, customer_name);

…which gives the correct aggregations:

The 99th percentile calculation will be the same, we just need to pass 0.99 into the percentile_cont function.

Let’s turn this statement into a derived table and finalise our aggregation script:

select
	u.dw_id as url_id		
	, c.dw_id as customer_id
	, count(u.dw_id) as url_count
	, MIN(response_time) as min_response_time
	, MAX(response_time) as max_response_time
	, AVG(response_time) as avg_response_time
	,median as med
	,percentile_95 as perc_95
	,percentile_99 as perc_99
from
	url_response_times d,	
  DimUrl u,  
  DimCustomer c,
  (select distinct median
			(response_time)
		over (partition by url, customer_name) as median, url,customer_name from url_response_times
		group by 
			(response_time)
			, url, customer_name) med,
	 (select distinct(percentile_95) AS percentile_95, url,customer_name from (select (response_time), 
	 		percentile_cont (0.95) within group (order by (response_time)) 
	 		over (partition by url, customer_name) as percentile_95, url, customer_name from url_response_times group by response_time, url, customer_name)) perc_95,
	 (select distinct(percentile_99) AS percentile_99, url,customer_name from (select (response_time), 
	 		percentile_cont (0.99) within group (order by (response_time)) 
	 		over (partition by url, customer_name) as percentile_99, url, customer_name from url_response_times group by response_time, url, customer_name)) perc_99	
where	
  d.url = u.url 
  	and d.customer_name = c.name			
  	and d.url = med.url
  	and d.customer_name = med.customer_name
  	and d.url = perc_95.url
  	and d.customer_name = perc_95.customer_name
  	and d.url = perc_99.url
  	and d.customer_name = perc_99.customer_name
group by 
	u.dw_id	
	, c.dw_id
	, med.median
	, perc_95.percentile_95
	, perc_99.percentile_99;

…and we have the correct calculations in place now:

Completing the fact table insertions

The last step now is to add these aggregated values to our fact table. That’s quite trivial:

insert into FactAggregationUrl
(
	url_id,
	customer_id,	
	url_call_count,
	min,
	max,
	avg,
	median,
	percentile_95,
	percentile_99
)
select
	u.dw_id	
	, c.dw_id
	, count(u.dw_id)
	, MIN(response_time)
	, MAX(response_time)
	, AVG(response_time)
	,median
	,percentile_95
	,percentile_99
from
	url_response_times d,	
  DimUrl u,  
  DimCustomer c,
  (select distinct median
			(response_time)
		over (partition by url, customer_name) as median, url,customer_name from url_response_times
		group by 
			(response_time)
			, url, customer_name) med,
	 (select distinct(percentile_95) AS percentile_95, url,customer_name from (select (response_time), 
	 		percentile_cont (0.95) within group (order by (response_time)) 
	 		over (partition by url, customer_name) as percentile_95, url, customer_name from url_response_times group by response_time, url, customer_name)) perc_95,
	 (select distinct(percentile_99) AS percentile_99, url,customer_name from (select (response_time), 
	 		percentile_cont (0.99) within group (order by (response_time)) 
	 		over (partition by url, customer_name) as percentile_99, url, customer_name from url_response_times group by response_time, url, customer_name)) perc_99	
where	
  d.url = u.url 
  	and d.customer_name = c.name			
  	and d.url = med.url
  	and d.customer_name = med.customer_name
  	and d.url = perc_95.url
  	and d.customer_name = perc_95.customer_name
  	and d.url = perc_99.url
  	and d.customer_name = perc_99.customer_name
group by 
	u.dw_id	
	, c.dw_id
	, med.median
	, perc_95.percentile_95
	, perc_99.percentile_99;

Let’s check the contents of the fact table:

select * from factaggregationurl;

Great, we have fulfilled all aggregation goals we set out before.

This was the last post of the actual series on RedShift. The next post will look at the larger picture and discuss RedShift’s place in Big Data. We’ll put RedShift to work in the overall Big Data demo application we started building way back in the first post on Amazon Kinesis.

View all posts related to Amazon Web Services and Big Data here.

Filed under .NET, Amazon, Big Data Tagged with amazon, amazon cloud, aws, big data, c#, postgresql, redshift

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

One Response to Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

dineshramitc says:

July 27, 2015 at 4:01 am

Reblogged this on Dinesh Ram Kali..

Reply

Exercises in .NET with Andras Nemes

Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

One Response to Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

Leave a comment Cancel reply

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Blogs I Follow

Exercises in .NET with Andras Nemes

Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

Share:

Related

One Response to Using Amazon RedShift with the AWS .NET API Part 9: data warehousing and the star schema 3

Leave a comment Cancel reply

My profile

Andras Nemes

Verified Services

Follow my blog via email

Top Posts & Pages

History

My tweets

Keywords

Blogs I Follow