Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 2: MaxMind source files
April 12, 2015 2 Comments
Introduction
In the previous post we outlined the goals of this series and the tools that we’re going to use. We’ve got as far as downloading MaxMind’s free version of their geo-location source with IPs and longitude-latitude co-ordinates. We saw that the downloaded package had a number of CSV files.
In this post we’ll start off by looking at the structure of those files and how they are connected.
The source files
At the time of writing this post the ZIP package was just below 30MB in size and contained the files visible in the screenshot below. Unzip it, which will be a lengthy process due to the large files inside.
Those are quite large but are still smaller than the 546MB of the IPv4 “Blocks” CSV file of the paid version. If you already have an application designed for opening large files then I recommend you use that. If not then you can go for the preview version of EmEditor available here. It can effortlessly open large files in chunks.
In this series we’ll concentrate on IPv4 and lng/lat co-ordinates and ignore IPv6. Open the the file called GeoLite2-City-Blocks-IPv4.csv:
The file contains millions of rows like that.
Let’s look at the columns:
- network: the IP address ranges in CIDR format. We’ll look into this in more details later on in the series
- geoname_id: the location ID where the IP belongs – we’ll see soon where this ID is stored
- registered_country_geoname_id: location ID where the IP is registered
- represented_country_geoname_id: location ID which the IP represents
- is_anonymous_proxy,is_satellite_provider: whether the IP is a proxy or belongs to a satellite
- postal_code,latitude,longitude: you probably understand what these mean
registered_country_geoname_id may look a bit mysterious but they are strongly related to the proxy and satellite provider flags. Here’s an example of a proxy in the CSV file:
2.62.9.177/32,,2017370,,1,0,,,
This IP doesn’t point to a “real” physical location. You’ll see that the “geoname_id” and the lng-lat co-ordinates are empty. However, we know that the proxy was registered in ID 2017370. A satellite location is similar:
5.11.17.0/24,,2635167,,0,1,,,
I copy the documentation of the different “country” definition from MaxMinds DEV page:
“We now distinguish between several types of country data. The country is the country where the IP address is located. The registered_country is the country in which the IP is registered. These two may differ in some cases.
Finally, we also include a represented_country key for some records. This is used when the IP address belongs to something like a military base. The represented_country is the country that the base represents. This can be useful for managing content licensing, among other uses.”
So where do these IDs point to? The answer lies in the other CSV files that come in different languages which you can see in the file names: “de”, “fr” etc. We’ll concentrate on the English version so open the one called GeoLite2-City-Locations-en.csv:
That’s the ID that the “Blocks” file is pointing at. The file contains more columns but my screen is not wide enough to show all of them. View all columns in the source to see what’s available: continent, country, region, city etc. You’ll see that in some places the source is incomplete. That’s the “price” you pay for the free version but it’s great for evaluation and code testing.
Our next goal is to put all that into DynamoDb. However, we cannot just copy-paste these files into DynamoDb tables. We’ll need to import the records in a special format that they fit the queries we’ll be executing and the libraries that we’ll use.
We’ll consider our strategy in the next post.
View all posts related to Amazon Web Services and Big Data here.
Reblogged this on Dinesh Ram Kali..
Any suggestion about how to store IPv6 Cidr values as well? Storing startip and endip for an IPv6 network in the dynamodb table could be problematic.