How we create cleaned, reproducable data for use in projects and apps

At the City of Amsterdam we deal with many different types of structured and unstructered data. Much of the data is not of high quality and are missing a lot of needed semantics to do proper analytics with.

This guide describes our workflow for creating a reproducable ETL data flow which we use to create better data for analytics and usage in dashboards and maps.

If you want to skip the explanation and want to go directly to our github data-processing repo:

First things first

We use these principles:

edit xml with draw.io.

ETL workflow

The current roles within the City of Amsterdam are described here with which people and teams we collaborate with to get our data:

The workflow consist of 4 steps:

  1. Getting the data
  2. Stage the data
  3. Clean the data
  4. Combine the data

1. Getting the data

Much of the data is still recieved by mail in the form of database dumps as tabular or spatial file. That creates a dependancy of the availability of person on the sending and recieving end regarding adding new data to the pipeline.

To solve this an make automation possible, clients are asked to:

A. API/web route

  1. Create an API token to access the api of the registration system and send the token by mail.
  2. Send an API token by mail.
  3. Send PHP login credentials by mail to use the API’s.
  4. If there is no api available: Ask the owner of the data to create an api on the registration system and help with the api output definitions.

B. Database route

  1. Recieve an user account for a datamart connection and ask for a ip connection through Motiv if it is not yet available @datapunt.

C. File delivery route

  1. Let the client upload a database dump periodically to the objectstore with a user/password.
  2. Send the (database) dump by mail to the data analist.
  3. Copy the dump on to a protected/encrypted USB stick on location.

Optionally: Send the ip ranges of our organisation to the application maintainers.

Responsibility of the data analist/engineer:

  1. Store the token/login in the password manager
  2. Store the userlogin/password in the password manager of the project or team
  3. Store the dump on the objectstore

2. Staging the data

At the city of Amsterdam we work a lot with spatial data. That’s why we use Postgres with the Postgis extention often to combine, enrich and clean spatial data.

To load the data into the database we use python in a Docker container to create a resuable and production viable import for every project.

A typical flow will look like this:

  1. Setup the credentials as environment variables.
  2. Login to objectstore or api.
  3. Download all records by looping through the API objects, downloading from the objectstore or loading them from a datamart.
  4. Save the files/objects in dataframe or dict and do some preliminary data cleanup.
  5. Set the schema if needed.
  6. Store the dataframe/dict in the postgres database in a schema with the name of the project.
  7. Add additional tables for semantic enrichment, like neighborhood areas or addresses.
  8. Create a view for use for analyis or further usage in Service backends like the Django Rest Framwork or programs like Tableau Desktop.

3. Cleaning the data

Most of the cleaning we do beforehand in pure python or using pandas. It the data is large we use mostly sql scripts in a postgres database and clean it with use of additional datasets like:

  1. Basisregistraties Which are all available as a daily postgres backup or as an API or WFS services.
  2. pdok.nl The Dutch geo data portal, which also has a nice geocoding api
  3. NL_Extract A Dutch initiative which publishes monthly many free available dumps of all Dutch addresses as a postgres dump, but also OSM, BRT and BRK and Top10NL.

To connect to these services more easily we made some python modules, which are explained here: https://amsterdam.github.io/data-processing/modules.html

4. Combining the data

After the data is cleaned, we combine the data mostly with postgres using Docker or with pandas dataframes. It depends on the output if it is served back as an API in Django for web services or as a flat CSV file to reuse in other visual programs like Tableau.

All Guides

Contributing

How to contribute to this City of Amsterdam Open Source project

Python backend projects

The style guide to the way we organize our Python back-end projects

How to code for humans

What we should think of when writing code so the most important computer we work with—the human brain—can parse it effectively

How we create cleaned, reproducable data for use in projects and apps

The way we make reusable data etl pipelines

How we create a docker environment for data analysis

How we set up a docker environment for analysis

How we code Python

The style guide to the way we code Python

How we track Open Source health

Understanding the impact and health of Open Source projects

How to write a README

The goto file to see what a project is and how to use it