Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 6: uploading IPv4 range to DynamoDb

Introduction

In the previous post we successfully created a limited IPv4 range file ready to be uploaded to DynamoDb. We saw how the relevant bits were extracted from the reduced subset of the MaxMind CSV source file and how the DynamoDb-specific input file was created.

In this post we’ll see how to upload the source file to DynamoDb using the bulk insertion tools available there. We’ll only import our limited test data but the same steps apply for large data sets as well.

Table setup in DynamoDb

Log onto the AWS web console and navigate to DynamoDb. Create a new table with the following characteristics:

  • Name: geo-ip-range-test
  • Primary key type: hash and range
  • Hash attribute name: “network_head” of type Number
  • Range attribute name: “network_start_integer” of type Number
  • (Click Continue to come to the indexes page)
  • Select Index type “Local secondary index”: this will automatically add “network_head (Number)” as the index hash key
  • As index range key insert “network_last_integer” of type Number
  • The index name will be autocompleted to network_last_integer-index, that’s good enough
  • Make sure that “All attributes” is selected in the Project attributes drop-down list
  • Click “Add index to table”
  • (Click continue)
  • Specify read and write capacity units at 5. It’s important to note that the write throughput will need to increase a lot when you’re ready to import the full data set of approx. 10 million rows otherwise the insertion process will take a very long time. You can modify the write throughput to something large just for the import phase, e.g. 10000 and then reduce it back to 5 or even smaller as writes won’t occur often – if anytime at all. Similarly the read throughput will need to increase a lot for the real database otherwise the queries will easily run into exceptions. With a large data set and low read throughput the query may need to scan too many records and quickly exceed the existing read throughput limit. We’ve set the read throughput to 1000 for our live geo IP range table. The throughput of 5 for both reads and writes will be OK for our small demo data set but will surely be too low for a 10-million-row data table
  • (Click continue)
  • You can select to set up a basic alarm – this is not vital for this demo exercise but is very useful for production databases. I let you decide whether you want to be notified in case of a throughput limit breach
  • Click continue to reach the Review pane and click Create

Wait for the table to reach status ACTIVE.

Importing the records

Click the “Export/Import” button in the menu:

Export-Import button in DynamoDb menu

Select the ip range table and click to import into DynamoDb:

Import into DynamoDb table button

You’ll be directed to the “Create Import Table Data Pipeline” view. Select the input folder of the DynamoDb input JSON file we created previously. Provide an S3 log folder as well. These logs can provide important information in case the import process fails. You can set the throughput rate to 100% as no other process will need access to the IP range table during this time. Leave the execution timeout at 24 hours. You can set your email to send notifications to.

Make sure that the selected pipeline role has access to DynamoDb and S3. Click “Create import pipeline” and you’ll see the pipeline listed in the Import Pipelines column:

Import pipeline listed in DynamoDb

Refresh the screen using the green arrow icon in the top right hand corner. You should see that the EMR activity is “waiting for a runner”:

Import export EMR activity waiting for runner

Be patient, it can take up to 15-20 minutes for the process to complete. We probably could have inserted the demo records one by one using code but I think this is a good exercise for the case of importing the full data set later.

If you have access to Amazon Data Pipeline then you can follow the process there directly:

Follow import process in Data pipeline

After a while the process should be RUNNING:

Import job running in Data pipeline

The same view is available in DynamoDb:

Import activity running in DynamoDb

Refresh the status periodically until the job finishes:

Import job finished in data pipeline

The same view looks like this in DynamoDb:

Import job finished in DynamoDb

It’s a good sign that the pipeline state and EmrActivity status are both marked as FINISHED in green.

Go back to the tables list in DynamoDb and check the contents of geo-ip-range-test:

IP range successfully imported into DynamoDb

In the next post we’ll see how to query this table using the Java AWS SDK to find the geoname ID of a single IP address.

View all posts related to Amazon Web Services and Big Data here.

Advertisements

About Andras Nemes
I'm a .NET/Java developer living and working in Stockholm, Sweden.

3 Responses to Using Amazon DynamoDb for IP and co-ordinate based geo-location services part 6: uploading IPv4 range to DynamoDb

  1. Glen says:

    Why did you create the local secondary index? The query + filter you use later on doesn’t utilise it.

    Unless I’m doing something wrong, the query can be quite inefficient and depending on the IP address you could be doing several pages of empty queries before finding a result.

    • Andras Nemes says:

      To be honest I don’t remember anymore why the secondary index was created. We abandoned this project soon after I finished the documentation series. You might want to redo the process without adding the secondary index and see if it still works fine. //Andras

      • Garrows says:

        Turns out I was doing something wrong. I wasn’t sorting in descending order which was making it inefficient. The secondary index does nothing though and I’d recommend removing it from the article.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

ultimatemindsettoday

A great WordPress.com site

Elliot Balynn's Blog

A directory of wonderful thoughts

Robin Sedlaczek's Blog

Developer on Microsoft Technologies

HarsH ReaLiTy

A Good Blog is Hard to Find

Softwarearchitektur in der Praxis

Wissenswertes zu Webentwicklung, Domain-Driven Design und Microservices

the software architecture

thoughts, ideas, diagrams,enterprise code, design pattern , solution designs

Technology Talks

on Microsoft technologies, Web, Android and others

Software Engineering

Web development

Disparate Opinions

Various tidbits

chsakell's Blog

Anything around ASP.NET MVC,WEB API, WCF, Entity Framework & AngularJS

Cyber Matters

Bite-size insight on Cyber Security for the not too technical.

Guru N Guns's

OneSolution To dOTnET.

Johnny Zraiby

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

%d bloggers like this: