Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 3: IPv4 range strategy
April 18, 2015 Leave a comment
Introduction
In the previous post we went through the details of the CSV source files that show the IP and lng/lat ranges and the actual locations. We saw that the two source files are linked by the location ID.
The next task is to import the source into DynamoDb. Recall that we want to handle queries based on IPs and lng/lat pairs separately, those are the primary goals of this series. The way to query an IP database is very different from querying a lng/lat database. An IP will fit into some IP range and we’d like to find that record. Whereas if you have a lng/lat co-ordinate pair and would like to find the nearest city/school/hospital/etc. within a certain radius then that query will involve some complex maths instead.
Strategy
As a result we’ll divide the “Blocks” CSV file into two different tables in DynamoDb: one for the IP range and another one for the lng/lat range.
In this post we’ll go through the strategy to store the IP ranges.
IP range table
Let’s consider first how we want to save the IP range data. The IP ranges in the the source are represented as strings in their CIDR formats, e.g. 1.0.0.0/24. /24 denotes the subnet. The entry 1.0.0.0/24 means the following range:
From 1.0.0.1 to 1.0.0.254
We’ll later see how to convert the CIDR format into lower and upper range IP addresses later on.
Say you get get the IP address 1.0.0.22. You could do some unwieldy string manipulation query and find that it lies within that range. However, that query will be extremely inefficient. Keep in mind that the database will have about 10 million rows. It would take a considerable amount of time even in a fast, full-blown relational database like MySQL to find that very row. In DynamoDb you’ll need to face some extra limitations such as read throughput and the upper limit of number of rows to scan within one query which is about 12000 rows at once. Therefore we need to come up with a query strategy that needs to scan as few rows as possible without involving strings.
Luckily for us this is nothing new and there’s a well-tested way of achieving this. The IPs need to be turned into their decimal representations. You can read the details of what that means and how it is achieved in this post.
So the idea is that we turn the IP ranges into lower and upper limit integers – or longs to be exact. Furthermore we’ll take the first element in the IP range, e.g. “127” of 127.0.0.1 and use it as part of a composite key in a DynamoDb IP range table. The other element in the composite key will be the lower limit of the IP range. Here’s an extract from our own IP range table to help you visualise what I mean:
network_head is the first digit in the IP range and will act as the primary hash key. network_start_integer represents the lower limit of the IP as decimal and will act as the primary range key in our DynamoDb table. If you don’t know what is meant by those key types then read this introduction on this blog.
You’ll recognise the geoname_id column. network_last_integer is the decimal representation of the upper limit of the IP range.
Hence our IP query can be based on integers only. We’ll see exactly how it’s done using the AWS Java SDK later on.
We’ll take a look at our strategy for the lng/lat range in the next part.
View all posts related to Amazon Web Services and Big Data here.