Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 8: creating the lng/lat coordinates source file for DynamoDb
May 3, 2015 Leave a comment
Introduction
In the previous post we successfully queried the limited IPv4 range table in DynamoDb and found the geoname ID that belongs to a single IP. We used 3 available integer properties in the table to narrow down the number of records that had to be scanned to reduce the query execution time and the risk for exceptions.
In this post we’ll start the same process for the lng/lat coordinate range. More specifically we’ll prepare the raw data file that can be uploaded into DynamoDb through S3. The process will be very similar to what we saw in this post where we created the IPv4 range source file. It is good idea to quickly re-scan that post to remind you of the process.
Preparation
We’ll go with the same reduced Blocks-IPv4 CSV file we created in the post referred to above. Here’s a reminder:
I’ll go with the following range in the sample:
From the first record…
1.0.0.0/24,2077456,2077456,,0,0,,-27.0000,133.0000
…until the end of the 1.0.x range i.e….
1.0.255.0/24,1151254,1605651,,0,0,83110,7.9833,98.3667
The source file gives 275 records at the time of writing this post. I saved the file as IPv4-range-sample.csv.
Downloading the necessary libraries
Amazon have a demo library to demonstrate the lng-lat based geo-services through DynamoDb and we’ll reuse a lot of ideas from there. I’ve cloned the project from GitHub. There’s at least one very specific reason to do so. It contains two libraries that for some reason are not available in the Maven repository – at least I’ve been unable to locate them. They are located in the /lib folder:
Each folder includes a JAR file. If you’re working with a Maven project you’ll need to save these in your Maven repository manually. If you’re not sure how to do it here‘s a very short guide.
Furthermore, there’s an additional library called dynamo-geo-1.0.0.jar available in the root folder:
Insert that library in the Maven repository as well.
Preparing the Maven project
Before we can transform the MaxMind source file into a DynamoDb-friendly import file we need to get the JAR dependencies for the Maven project. You can already now install the AWS Java SDK although we’ll only use it later. You might even have it already from the previous posts.
Here’s the list of dependencies we’ll need for the lng/lat transformation and query process:
<dependencies> <dependency> <groupId>com.amazonaws</groupId> <artifactId>aws-java-sdk</artifactId> <version>1.9.9</version> </dependency> <dependency> <groupId>com.google.common.geometry</groupId> <artifactId>s2-geometry-java</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>com.google.common</groupId> <artifactId>guava-r09</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>com.amazonaws.geo</groupId> <artifactId>dynamodb-geo</artifactId> <version>1.0.0</version> </dependency> <dependency> <groupId>org.codehaus.jackson</groupId> <artifactId>jackson-mapper-asl</artifactId> <version>1.9.13</version> </dependency> <dependency> <groupId>org.codehaus.jackson</groupId> <artifactId>jackson-core-asl</artifactId> <version>1.9.13</version> </dependency> </dependencies>
Transforming the IPv4 Block file
Actually we won’t transform the source file but rather read the necessary elements from it and create an input file for DynamoDb ready to be imported from S3. Here’s a short description of the steps we’re going to take:
- Create the source file for DynamoDb based on the reduced IPv4 sample
- Upload it to S3
- Import it from S3 to DynamoDb using the built-in bulk insertion tool in DynamoDb
Before I present any code let’s see step by step what it will need to carry out:
- Read the reduced CSV source file line by line
- Extract the geoname, longitude and latitude values
- Save the geoname, lng and lat combination in a list to avoid duplicates. We don’t want to store the same lng/lat combinations over and over again. A single city can have a long range of IPs and we want to get rid of duplicates in the database. Also, the DynamoDb import process will complain if it finds two identical records and the import process will fail. The code will omit proxy and satellite locations where the source file has no longitude and latitude data
- If the combination is unique then we build a single row in the DynamoDb import file. We use the libraries referenced above to build a geo-hash and a hash key for our table. The range key will be simply the record counter – as long as it’s unique and can be converted into a string you’ll be fine, I went for the easiest option. The geo-hash number will be a result of a long and complex mathematical calculation that Google implemented in the S2 geometry library. It will be a large number that uniquely represents a coordinate pair
- Use all those elements to build a string that can be attached to a DynamoDb-formatted JSON file. DynamoDb cannot just be fed any textual source file. It needs to clearly show the boundaries of each data record and its type. In this example we have numeric and string fields denoted by “n” and “s”. The individual elements must be delimited by end-of-text and start-of-text characters, denoted by 0x03 and 0x02
Insert the following code block into your project:
private static void produceGeoLatLngSourceFileForS3LineByLine() { String sourceFileFullPath = "C:\\path-to-reduced-maxmind-datasource\\IPv4-range-sample.csv"; String targetFileFullPath = "C:\\path-to-reduced-maxmind-datasource\\GeoIP2-City-Blocks-LngLat-S3-ImportSourceFile.json"; char endOfTextCharacter = 0x03; char startOfTextCharacter = 0x02; int uniqueRecordCounter = 0; try { File targetFile = new File(targetFileFullPath); List<String> uniqueKeyContainer = new LinkedList<>(); if (!targetFile.exists()) { boolean createNewFile = targetFile.createNewFile(); if (!createNewFile) { throw new IOException("Could not create target file"); } } InputStream fis = new FileInputStream(sourceFileFullPath); BufferedReader br = new BufferedReader(new InputStreamReader(fis, Charset.forName("UTF-8"))); String line; int lineCounter = 0; try (PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter(targetFile, true)))) { while ((line = br.readLine()) != null) { if (lineCounter == 0) { lineCounter++; continue; } lineCounter++; try { String[] columns = line.split(","); String geonameId = columns[1]; String lat = columns[7]; String lng = columns[8]; if ((lat == null || lat.equals("")) || (lng == null || lng.equals("")) || (geonameId == null || geonameId.equals(""))) { System.out.println("Found proxy or satellite"); } else { String key = geonameId.concat("|").concat(lng).concat("|").concat(lat); if (!uniqueKeyContainer.contains(key)) { uniqueRecordCounter++; uniqueKeyContainer.add(key); double latitude = Double.parseDouble(lat); double longitude = Double.parseDouble(lng); String locationId = geonameId; GeoPoint geoPoint = new GeoPoint(latitude, longitude); long geoHash = S2Manager.generateGeohash(geoPoint); long hashKey = S2Manager.generateHashKey(geoHash, 6); String geoJson = GeoJsonMapper.stringFromGeoObject(geoPoint); StringBuilder geoJsonBuilder = new StringBuilder(); geoJsonBuilder.append("{\"s\":\"{\\\"type\\\":\\\"Point\\\",\\\"coordinates\\\":["); geoJsonBuilder.append(latitude).append(",").append(longitude).append("]}\"}"); StringBuilder rowBuilder = new StringBuilder(); rowBuilder.append("rangeKey").append(endOfTextCharacter) .append("{\"s\":\"").append(uniqueRecordCounter).append("\"}") .append(startOfTextCharacter).append("geoJson").append(endOfTextCharacter) .append(geoJsonBuilder.toString()) .append(startOfTextCharacter).append("hashKey").append(endOfTextCharacter) .append("{\"n\":\"").append(hashKey).append("\"}") .append(startOfTextCharacter).append("geoname_id").append(endOfTextCharacter) .append("{\"n\":\"").append(locationId).append("\"}") .append(startOfTextCharacter).append("geohash").append(endOfTextCharacter) .append("{\"n\":\"").append(geoHash).append("\"}") .append(System.lineSeparator()); out.print(rowBuilder.toString()); } } } catch (Exception ex) { System.out.println(ex.getMessage()); } } } catch (IOException ex) { System.out.println(ex.getMessage()); } } catch (IOException ex) { System.out.println(ex.getMessage()); } System.out.println("File creation done. Number of unique records: ".concat(Integer.toString(uniqueRecordCounter))); }
It’s probably best if you run that code in Debug mode and see what happens exactly. If everything goes well then you should have a .json file which if opened in Notepad++ should look as follows:
Note that the unique data record counter stopped at 42 for the reduced IPv4 range data source I selected above. So out of the first 275 rows in the current MaxMind CSV file 233 were filtered out as duplicates. You can expect the lng/lat range table to be much much smaller than the IPv4 range table after all duplicates have been removed. In our real-life case we have just over 134k records in our lng/lat range table compared to slightly above 10 million rows in the full IPv4 range table.
Upload the file to S3 within some folder. Make sure that this input file is the only object in that folder. You can already now create another empty folder called “logs” where the DynamoDb import process will send the log messages.
We’ll see how to upload these records into DynamoDb in the next post.
View all posts related to Amazon Web Services and Big Data here.