Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 1: introduction and goals
April 11, 2015 1 Comment
Introduction
There are a lot of applications out there that involve finding a point on a map. Putting hotels, restaurants, metro stations etc. on a map on your mobile device has become commonplace. Queries that find the nearest hospital, theatre or school need to be executed in a fast and efficient manner.
In this series we’ll discuss a possible solution to the following geo-location related scenarios:
- You have a pair of longitude (lng) and latitude (lat) co-ordinates and you’d like to find all locations in a circle around that point or just the nearest relevant location, e.g. the nearest city
- You have an IP address and you’d like to find the location details of that address, such as New York or Sydney
The series is centred around Amazon cloud based tools. Even if you’re not familiar with Amazon Cloud but looking for a solution to questions similar to the ones outlined above I encourage you to read on – you might just find something useful.
The location in our examples will be cities but it could well be anything else that you can put on a map.
There are a number of services nowadays that offer programmatic geo-location lookups. We’ll look at a cloud-based alternative that is based on the Amazon NoSql database called DynamoDb and the AWS (Amazon Web Services) Java SDK.
For the IP lookup we’ll use “normal” DynamoDb search techniques in code. For the lng-lat based circular queries we’ll need something more serious as it involves some complex maths. Fortunately AWS – and Google – have prepared Java libraries for that purpose so we’ll use them. The AWS geo-spatial library is also based on DynamoDb so that’s an extra motivation for us to choose that data source.
We’ll go through the entire process in details. We’ll download and inspect the raw data source, upload it to DynamoDb in various forms and query the database. We’ll write the code examples in Java so some familiarity with that language is also required.
Note that there are web services that offer IP and lng/lat based lookups through their endpoints. The company MaxMind, whose raw geolocation data source files we’ll use for the demo, also offers such a – paid – service. Why would you want to build your own in-house solution then? Here are some advantages:
- Paid web services charge by lookup whereas you can by the source up front and do whatever you want with it in your application
- They set a limit on the total number of lookups and quite often on the number of lookups per seconds/minutes. An in-house solution has no such limitations
- A web service lookup will take x amount of time and there’s not much you can do about it. On the other hand you can always adjust your DB source in a way that your queries will run faster
Amazon DynamoDb
I have a longer and more detailed series on DynamoDb on this blog starting here. It is in .NET but the Java equivalent is almost identical. The Java and .NET AWS SDKs are often so similar that you can directly translate an AWS C# program into its Java equivalent without consulting any other resource.
DynamoDb is a fast, scalable and efficient NoSql storage that can act as the data store of any application type that can communicate through HTTP. Amazon have prepared a wide range of SDKs and Java is no exception. Therefore any Java application will be able to communicate with DynamoDb: creating and querying tables, adding, deleting and updating records will be straightforward.
We’ll use DynamoDb to store the raw data for our geo-spatial queries.
Preparations part 1: Amazon
You’ll need access to Amazon Web Services in order to try the examples. DynamoDb has a free-tier meaning you can play around in it with some limited data. That’s more than enough for evaluation purposes. This page includes more information on how free-tier works and how to set up an account. By signing up with Amazon and creating a user you’ll also get a pair of security keys: an Amazon Access Key and a Secret Access Key. You can also create users with the Identity and Access Management (IAM) tool of Amazon.
Get accustomed with the DynamoDb GUI. Read at least the first post in the series referenced above to get up and running with the database. It’s really not complex at all, especially if you are familiar with databases in general.
You’ll also need access to and some basic knowledge of Amazon S3. We’ll use it to temporarily store the geo-location raw data in a format that fits DynamoDb for the bulk insertion operation when we’ll fill up the DynamoDb tables with IP and longitude-latitude ranges.
There are two more Amazon components that we’ll briefly mention but not discuss in details: Data Pipeline, which is an automation mechanism and Elastic MapReduce (EMR) which is a tool to start a new cluster of EC2 machines that can execute one or more jobs.
It would be very tedious to enter all geolocation data manually into DynamoDb. Keep in mind that it can mean millions of records. So we’ll use the import mechanisms built into DynamoDb which will need a raw data source file stored in S3. The automation mechanism will start a job in Data Pipeline which in turn will start an EMR cluster to carry out the bulk insertion.
Preparations part 2: getting the source data
Several companies offer raw data files with IPs and lng/lat co-ordinate pairs that are linked to physical locations on Earth. As you can imagine they are quite large. They also typically come in two versions: a free version with incomplete data that you can test and paid one with the full data set.
In our project we use the GeoIP2 CSV files produced and maintained by MaxMind. They offer IPv4 and IPv6 source files as well in the same full package. The freely available test version is available here:
Download the ZIP file – it should be called GeoLite2-City-CSV.zip and save it somewhere. Unzip its contents and you’ll see that it will include a number of CSV files.
We’ll see how the files are structured and linked in the next post.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on Dinesh Ram Kali..