Calculating commute distance using SSIS and the GoogleMaps API

Home - Azure - Calculating commute distance using SSIS and the GoogleMaps API

Last week, I had an interesting question from one of my clients about estimates of planned commute distances. For example: What is the distance I travel when I, living at Kalverstraat 7 (Amsterdam) commute five days a week to my customer in Paris (1, rue Victor Cousin)? By the way: it should be calculated for all employees, for all customer locations, so manually entering distances is not considered an option.

Two options came to my mind:

  1. Translate locations to geo-coordinates, then calculating distance as the crow flies
  2. Use a cloud-service to give a somewhat more accurate estimate

Back in 2005 – when cloud services weren’t that widespread – I’d consider the first option. But look at the calendar – it’s 2016! Cloud, baby!

Azure DataMarket

So, as a Microsoft-oriented BI guy, my first guess to look at was Azure Datamarket (now included in Azure Marketplace). Integrated with DQS, it would be an easy game to calculate travel distances, right?

Well, no. A few quick searches showed that there are some services available offering travel distances, but they’re pretty limited – requiring coördinates as input (of course, customers always stores addresses as coördinates) or being limited to Great Britain[ref]Great Britain: once a country consisting of Scotland, Northern Ireland and England. After leaving the EU they lost Scotland, Northern Ireland and a huge amount of their GDP. Called “Less Britain” ever since /sarcasm [/ref]. There’s also the Bing Maps API, but I didn’t find anything there to just calculate travel distances (I thinkgeocoding API’s are the closest, maybe combined with the routes API. Still, neither gives me just the distance). Bollocks.

Google DistanceMatrix

But more cloud-companies exist out there. Google, for exapmle, has a pretty solid web service called “DistanceMatrix” calculating distances between geographical locations – anywhere on the world, in any input format. Eat that, Azure Data Market! Of course with Google there’s no WSDL file, but the service is really simple to use (check out the docs). All you need to do:

  1. Write some test queries to discover capabilities
  2. Sign up for an API key
  3. Design your ETL-process

Designing an ETL process using an external API

Think well about this third step: you can save quite a few dollars annually by just keeping a local cache of travel distances, and only querying when the distance is unknown. My ETL process for this part consists of three global steps:

  1. Add unknown departure/destination pairs to the ‘to be queried’ table (PK of this table is start & end point address in less-structured format, ensuring uniqueness of commutes)
  2. Query Maps API for unknown travel distances. Add retrieved distances (or known unknowns) to the local cache table of travel distances
  3. Use the local cache of travel distance (as complete as it gets at this moment) as the primary lookup for travel distance

Currently, I’ve chosen to have the cache process explicitly modeled in my ETL structure. Another option would be to make a .Net UDF keeping a local (or even table) cache. This would simplify the looks of the process design by putting everything inside an opaque object. Keep in mind what happens if the ETL process breaks though: can your DWH/ETL administrators still troubleshoot?

To query the maps API, I use a Transformation Script Component inside SSIS consisting of the following code:


As you see, it’s pretty easy to query the Google Maps API (you could make it even easier by creating a Custom Component in SSIS). On top of that, it makes one think: what if data integration would be always this easy? Think about it: what if you could just say: “Here are some characteristics of this customer. I don’t know if it’s 100% accurate and if all data is in the correct fields, but please figure out who I mean and show me where he is in my sources…”.

That, my friend, would be a true Integration Service.

Latest Posts