PyData Eindhoven 2022

How to not pull your hair out while providing data to the business: unit testing for your data pipelines
12-02, 15:10–15:40 (Europe/Amsterdam), Ernst-Curie

Other people who use your datasets is nice, but updating the logic behind it could cause breaking dashboards and ML models down the line. In this talk I will explain how to prevent these stressful situations by applying unit testing to your data or preprocessing pipelines in Python.


Unit testing is a testing method that is often used for building software. It helps in building more robust applications, so as a developer you can release with more confidence and less errors in production. But can you use this technique as a data engineer or data scientist for you preprocessing or pipeline code as well?

It turns out you can! And you can reap the same benefits that software engineers experience, such extra confidence before deployment, in your day-to-day work!

In this talk I will walk you through the conceptual idea of unit testing for data or preprocessing pipelines and provide an example on how to apply it to a very simple use case that uses Pandas. The example will test some transformations on beer data 😉.

I will walk through a five step process with code examples in Python. That way data scientists and data engineers have practical guidance on how to apply it to their own projects by showing how to:
- Define what the logic is that you want to test within your project;
- Refactor your code to separate specific parts in your code;
- Generate synthetic data for testing purposes;
- Put your refactored code under tests using Pytest;
- Set up a CI pipeline in your Git provider in order to test code automatically.


Prior Knowledge Expected –

No previous knowledge expected

Data Engineer and co-founder of Blenddata.

Helping clients to build data platforms in the cloud.

Enthousiastic about clean code, cloud, self-service solutions and music.