PyData Eindhoven 2022

Becoming a Pokémon Master with DVC: reproducible machine learning experiments
12-02, 15:50–16:20 (Europe/Amsterdam), Auditorium

In machine learning projects we need to experiment in order to find and maintain the best-performing model. While we can do initial prototyping in a Notebook, eventually we need to move towards more structured experiment tracking to facilitate reproducibility of our experiments.

The open-source DVC library aims to tackle this problem through a Git-based approach to versioning data and artifacts. In this talk we will explore how DVC works, how we can apply it to conduct ML experiments, and how we can use it to become a great Pokémon trainer.

Every data scientist has at one point kept track of their experiments on paper, sticky notes, or in a spreadsheet. But how can we guarantee reproducibility for potentially thousands of experiments over numerous years? Can we figure out which version of a model ran in production six months ago, and what data went into its training?

The talk is aimed at data scientists and explores best practices for ML projects using a light-hearted topic. Some general knowledge of how ML works is expected, but not necessary to understand the talk. The key concept is reproducibility: how can we track and version not just code, but entire experiments?

DVC is a potential solution for this. The philosophy behind it can be summarized as "Git for data and models". I will discuss its concepts and show how it works in practice for a classifier of Pokémon sprites.

The main takeaway will be the importance of reproducibility and a demo on how to achieve this.

Prior Knowledge Expected

No previous knowledge expected

Rob is a developer advocate at Iterative AI. He’s got a background in information sciences, and experience in data analytics and engineering. Right now he’s learning a whole lot about MLOps and exploring how people can adopt a collaborative, experiment-driven approach to ML projects.