Big Data: a summary of various Amazon Big Data tools
December 17, 2016 Leave a comment
We have gone through a lot of material about Big Data on this blog. This post summarises each Amazon Cloud component one by one, what they do and what their roles are in a Big Data architecture.
Amazon Kinesis is a highly scalable cloud-based messaging system which can handle extremely large amounts of messages. Its main purpose in a cloud-based Big Data architecture is to accept and temporarily save messages coming from a variety of sources: smart devices, websites, cars, applications, i.e. anything that can be connected to the Internet and provide various data, such as response times, stock prices, user behaviour, energy consumption, whatever. Kinesis can save any type of unstructured message for 24 hours before it is deleted.
Kinesis therefore provides the entry point into an AWS Big Data architecture.
S3 provides a secure, durable and highly-scalable object storage solution. You can use S3 to store just about any type of file: images, text files, videos, JAR files, HTML pages etc. All files are stored in a key-value map, i.e. each file has a key where the file itself is the value. For Big Data the messages will most likely be stored as text files organised into various folders so that both humans and applications can find and consume the messages relatively easily.
S3 therefore provides a permanent storage for the raw data coming from S3. There must be an application which extracts the messages from Kinesis and stores them in S3 and we saw a .NET-based solution in the series on S3.
DynamoDb is Amazon’s take on NoSql databases. It is a fast, scalable and efficient NoSql storage that can act as the data store of any application type that can communicate through HTTP. NoSql databases normally don’t follow any schema so you can potentially store very different records in the same table.
As DynamoDb is a storage mechanism it can potentially be used in any storage-related scenario within a Big Data architecture. One such example is raw data storage, i.e. a place to store the incoming raw data messages from the devices that send their data to your system.
However, DynamoDb was not exactly designed to handle a large influx of messages within a short period of time. You’d need to increase the write throughput to a very large number otherwise you’ll start getting exceptions because the specified throughput limit has been exceeded. As a result some of your messages will be lost and your Amazon bill will be higher.
Instead, DynamoDb can be used in another way, namely storing data aggregations. Tabular data from Elastic MapReduce can easily be exported to DynamoDb tables. Also, DynamoDb can be used for storing state data. E.g. we use DynamoDb tables to save the state of the aggregation and Kinesis consumer applications. The state tables provide a way for them to communicate as they are completely separated.
EMR is Amazon’s web service with the Hadoop framework installed. Hadoop is eagerly used in the distributed computing world where large amounts of potentially unstructured input files are stored across several servers. The various technologies built upon Hadoop, such as Hive and Pig enable you to run aggregation jobs on the raw data.
EMR is easily integrated with S3 and DynamoDb. It provides a highly scalable and stable mechanism to extract some useful statistics from the raw data saved in S3.
Amazon RedShift is Amazon’s data warehousing solution and is especially well-suited for Big Data scenarios where petabytes of data must be stored and analysed. It follows a columnar DBMS architecture and it was designed especially for heavy data mining requests.
The role of RedShift in an AWS Big Data architecture is similar to that of EMR. As it can easily load data from S3 it is a powerful tool for aggregations over a large amount of information.
Here’s how much space we’ve spent on AWS Big Data on this blog so far:
- Kinesis: 6 posts
- S3: 6 posts
- Dynamo Db: 7 posts
- Elastic MapReduce: 8 posts
- RedShift: 10 posts
That’s 37 posts in total. If we also consider another series on AWS Big Data, which goes through the components from a more architectural point of view, with its 5 posts then we come to 42 posts. If we consider that the word count of each post is about 1000 then we’ve spent around 40000 words on this topic. That sounds like a good starting point for a book…
We’ll leave Amazon Big Data behind for a while and start something completely different in the next series: we’ll revisit the SOLID software principles with new examples.
View all posts related to Amazon Web Services and Big Data here.