Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 8: creating the lng/lat coordinates source file for DynamoDb

Introduction

In the previous post we successfully queried the limited IPv4 range table in DynamoDb and found the geoname ID that belongs to a single IP. We used 3 available integer properties in the table to narrow down the number of records that had to be scanned to reduce the query execution time and the risk for exceptions.

In this post we’ll start the same process for the lng/lat coordinate range. More specifically we’ll prepare the raw data file that can be uploaded into DynamoDb through S3. The process will be very similar to what we saw in this post where we created the IPv4 range source file. It is good idea to quickly re-scan that post to remind you of the process.

Preparation

We’ll go with the same reduced Blocks-IPv4 CSV file we created in the post referred to above. Here’s a reminder:

I’ll go with the following range in the sample:

From the first record…

1.0.0.0/24,2077456,2077456,,0,0,,-27.0000,133.0000

…until the end of the 1.0.x range i.e….

1.0.255.0/24,1151254,1605651,,0,0,83110,7.9833,98.3667

The source file gives 275 records at the time of writing this post. I saved the file as IPv4-range-sample.csv.

Downloading the necessary libraries

Amazon have a demo library to demonstrate the lng-lat based geo-services through DynamoDb and we’ll reuse a lot of ideas from there. I’ve cloned the project from GitHub. There’s at least one very specific reason to do so. It contains two libraries that for some reason are not available in the Maven repository – at least I’ve been unable to locate them. They are located in the /lib folder:

Necessary libraries for AWS lnglat geoservices

Each folder includes a JAR file. If you’re working with a Maven project you’ll need to save these in your Maven repository manually. If you’re not sure how to do it here‘s a very short guide.

Furthermore, there’s an additional library called dynamo-geo-1.0.0.jar available in the root folder:

DynamoDb geo jar in root folder

Insert that library in the Maven repository as well.

Preparing the Maven project

Before we can transform the MaxMind source file into a DynamoDb-friendly import file we need to get the JAR dependencies for the Maven project. You can already now install the AWS Java SDK although we’ll only use it later. You might even have it already from the previous posts.

Here’s the list of dependencies we’ll need for the lng/lat transformation and query process:

<dependencies>
    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk</artifactId>
        <version>1.9.9</version>
    </dependency>
    <dependency>
        <groupId>com.google.common.geometry</groupId>
        <artifactId>s2-geometry-java</artifactId>
        <version>1.0.0</version>
    </dependency>
    <dependency>
        <groupId>com.google.common</groupId>
        <artifactId>guava-r09</artifactId>
        <version>1.0.0</version>
    </dependency>
    <dependency>
        <groupId>com.amazonaws.geo</groupId>
        <artifactId>dynamodb-geo</artifactId>
        <version>1.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-mapper-asl</artifactId>
        <version>1.9.13</version>
    </dependency>
    <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-core-asl</artifactId>
        <version>1.9.13</version>
    </dependency>    
</dependencies>

Transforming the IPv4 Block file

Actually we won’t transform the source file but rather read the necessary elements from it and create an input file for DynamoDb ready to be imported from S3. Here’s a short description of the steps we’re going to take:

  • Create the source file for DynamoDb based on the reduced IPv4 sample
  • Upload it to S3
  • Import it from S3 to DynamoDb using the built-in bulk insertion tool in DynamoDb

Before I present any code let’s see step by step what it will need to carry out:

  • Read the reduced CSV source file line by line
  • Extract the geoname, longitude and latitude values
  • Save the geoname, lng and lat combination in a list to avoid duplicates. We don’t want to store the same lng/lat combinations over and over again. A single city can have a long range of IPs and we want to get rid of duplicates in the database. Also, the DynamoDb import process will complain if it finds two identical records and the import process will fail. The code will omit proxy and satellite locations where the source file has no longitude and latitude data
  • If the combination is unique then we build a single row in the DynamoDb import file. We use the libraries referenced above to build a geo-hash and a hash key for our table. The range key will be simply the record counter – as long as it’s unique and can be converted into a string you’ll be fine, I went for the easiest option. The geo-hash number will be a result of a long and complex mathematical calculation that Google implemented in the S2 geometry library. It will be a large number that uniquely represents a coordinate pair
  • Use all those elements to build a string that can be attached to a DynamoDb-formatted JSON file. DynamoDb cannot just be fed any textual source file. It needs to clearly show the boundaries of each data record and its type. In this example we have numeric and string fields denoted by “n” and “s”. The individual elements must be delimited by end-of-text and start-of-text characters, denoted by 0x03 and 0x02

Insert the following code block into your project:

private static void produceGeoLatLngSourceFileForS3LineByLine()
{
    String sourceFileFullPath = "C:\\path-to-reduced-maxmind-datasource\\IPv4-range-sample.csv";
    String targetFileFullPath = "C:\\path-to-reduced-maxmind-datasource\\GeoIP2-City-Blocks-LngLat-S3-ImportSourceFile.json";
    char endOfTextCharacter = 0x03;
    char startOfTextCharacter = 0x02;

    int uniqueRecordCounter = 0;
    try
    {
        File targetFile = new File(targetFileFullPath);
        List<String> uniqueKeyContainer = new LinkedList<>();
        if (!targetFile.exists())
        {
            boolean createNewFile = targetFile.createNewFile();
            if (!createNewFile)
            {
                throw new IOException("Could not create target file");
            }
        }
        InputStream fis = new FileInputStream(sourceFileFullPath);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis, Charset.forName("UTF-8")));
        String line;
        int lineCounter = 0;

        try (PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(targetFile, true))))
        {
            while ((line = br.readLine()) != null)
            {
                if (lineCounter == 0)
                {
                    lineCounter++;
                    continue;
                }
                lineCounter++;

                try
                {
                    String[] columns = line.split(",");

                    String geonameId = columns[1];
                    String lat = columns[7];
                    String lng = columns[8];

                    if ((lat == null || lat.equals(""))
                            || (lng == null || lng.equals(""))
                            || (geonameId == null || geonameId.equals("")))
                    {
                        System.out.println("Found proxy or satellite");
                    } else
                    {
                        String key = geonameId.concat("|").concat(lng).concat("|").concat(lat);
                        if (!uniqueKeyContainer.contains(key))
                        {
                            uniqueRecordCounter++;
                            uniqueKeyContainer.add(key);
                            double latitude = Double.parseDouble(lat);
                            double longitude = Double.parseDouble(lng);
                            String locationId = geonameId;
                            GeoPoint geoPoint = new GeoPoint(latitude, longitude);
                            long geoHash = S2Manager.generateGeohash(geoPoint);
                            long hashKey = S2Manager.generateHashKey(geoHash, 6);
                            String geoJson = GeoJsonMapper.stringFromGeoObject(geoPoint);
                            StringBuilder geoJsonBuilder = new StringBuilder();
                            geoJsonBuilder.append("{\"s\":\"{\\\"type\\\":\\\"Point\\\",\\\"coordinates\\\":[");
                            geoJsonBuilder.append(latitude).append(",").append(longitude).append("]}\"}");
                            StringBuilder rowBuilder = new StringBuilder();
                            rowBuilder.append("rangeKey").append(endOfTextCharacter)
                                    .append("{\"s\":\"").append(uniqueRecordCounter).append("\"}")
                                    .append(startOfTextCharacter).append("geoJson").append(endOfTextCharacter)
                                    .append(geoJsonBuilder.toString())
                                    .append(startOfTextCharacter).append("hashKey").append(endOfTextCharacter)
                                    .append("{\"n\":\"").append(hashKey).append("\"}")
                                    .append(startOfTextCharacter).append("geoname_id").append(endOfTextCharacter)
                                    .append("{\"n\":\"").append(locationId).append("\"}")
                                    .append(startOfTextCharacter).append("geohash").append(endOfTextCharacter)
                                    .append("{\"n\":\"").append(geoHash).append("\"}")
                                    .append(System.lineSeparator());

                            out.print(rowBuilder.toString());
                        }
                    }
                } catch (Exception ex)
                {
                    System.out.println(ex.getMessage());
                }
            }
        } catch (IOException ex)
        {
            System.out.println(ex.getMessage());
        }

    } catch (IOException ex)
    {
        System.out.println(ex.getMessage());
    }

    System.out.println("File creation done. Number of unique records: ".concat(Integer.toString(uniqueRecordCounter)));
}

It’s probably best if you run that code in Debug mode and see what happens exactly. If everything goes well then you should have a .json file which if opened in Notepad++ should look as follows:

DynamoDb import file for lng-lat range

Note that the unique data record counter stopped at 42 for the reduced IPv4 range data source I selected above. So out of the first 275 rows in the current MaxMind CSV file 233 were filtered out as duplicates. You can expect the lng/lat range table to be much much smaller than the IPv4 range table after all duplicates have been removed. In our real-life case we have just over 134k records in our lng/lat range table compared to slightly above 10 million rows in the full IPv4 range table.

Upload the file to S3 within some folder. Make sure that this input file is the only object in that folder. You can already now create another empty folder called “logs” where the DynamoDb import process will send the log messages.

We’ll see how to upload these records into DynamoDb in the next post.

View all posts related to Amazon Web Services and Big Data here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

Elliot Balynn's Blog

A directory of wonderful thoughts

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

My goal with this blog is to offend everyone in the world at least once with my words… so no one has a reason to have a heightened sense of themselves. We are all ignorant, we are all found wanting, we are all bad people sometimes.

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: