PyData Eindhoven 2022

A Tour of the Many DataFrame Frameworks
12-02, 09:50–10:20 (Europe/Amsterdam), Auditorium

Processing tabular data has been of the most common operations for data scientists and engineers for a while now. A few years ago, pandas was the single tool of reference for it, but is it still true today?
In this talk, we will review and compare the existing dataframe frameworks to see how they solve the challenges of performance, scalability and user experience.


Processing tabular data has been of the most common operations for data scientists and engineers for a while now. A few years ago, pandas was the single tool of reference for it, but is it still true today?

The increase in the size of the datasets and in the diversity of the use-cases has highlighted many challenges regarding performance, scalability and user experience. The ecosystem has evolved to now include many new alternatives, each of them tackling one or more of those dimensions differently. Some of them even put SQL back under the spotlight!

In this talk we will deep dive into the internals of tabular data processing and look at how the main players of the ecosystem work under the hood. After defining the fundamentals, we will zoom on their APIs and memory models through various examples, so that the audience can get an illustrated comparison between frameworks.


Prior Knowledge Expected

Previous knowledge expected

Harizo is the VP of Tech Content & Enablement at Dataiku, a company that offers a platform to build, deploy and run data science and machine learning projects at scale. He leads the Developer Advocacy team, which mission is to facilitate the adoption of the Dataiku platform by its most technical users.

Harizo started at Dataiku as a Data Scientist before moving to the Engineering team. Prior to that, he completed a PhD in mathematics on probabilistic simulations at scale for atmospheric physics.