PyData Eindhoven 2022
All organizations will need to become data-driven organizations, or they will go the way of the dinosaur. However, AI scales risk to organizational brand and profit. Trustworthy and Ethical AI are no longer luxuries, but business necessities. Let's explore together, why bias is not exclusive to AI, why technology has never been neutral and why Data Science has little to do with Science!
Processing tabular data has been of the most common operations for data scientists and engineers for a while now. A few years ago, pandas was the single tool of reference for it, but is it still true today?
In this talk, we will review and compare the existing dataframe frameworks to see how they solve the challenges of performance, scalability and user experience.
In the chip industry, time is money. Customers of ASML’s lithography systems expect high uptimes. But expected and unexpected maintenance is part of that equation, sometimes requiring to halt the production temporarily.
In this presentation, we show you how we are building and deploying Machine Learning models to predict upcoming maintenance actions within the upcoming three months. Our work helps to boost productivity, maximize system utilization and reduce unexpected workload for ASML’s customer support.
With targeted ads becoming more prevalent in the digital landscape, we share how we used Thompson sampling and a Hierarchical Bayesian Algorithm that makes its own decisions and serves the right ad to the right audience.
We present FuzzyTM, a Python library for training fuzzy topic models and creating topic embeddings for downstream tasks. Its modular design allows researchers to modify each software element and for future methods to be added. Meanwhile, the user-friendly pipelines with default values allow practitioners to train a topic model with minimal effort.
In this talk I hope to convince you that models are not either predictive or causal, but both perspectives should be combined to solve real world problems. I will use a concrete example of how we automate irrigation in greenhouses at Source.
Inefficiencies in the flight preparation processes (turnaround) are accountable for around 30% of the total delays at Royal Schiphol Group (the Amsterdam airport). This process has been a black box and for this reason, it was quite hard to improve. To open the turnaround black box, Schiphol has developed technology based on computer vision using deep learning that detects many different turnaround-related tasks from images that are streamed from cameras located in the aircraft ramps in real time. In this session, we will explain how this project started, the technologies that we have applied, and the business impact that is generated at enabling the airport to reduce delays.
Data is everywhere. It is through analysis and visualization that we are able to turn data into information that can be used to drive better decision making. Out-of-the-box tools will allow you to create a chart, but if you want people to take action, your numbers need to tell a compelling story. Learn how elements of storytelling can be applied to data visualization.
Building and fine-tuning models is exciting, but how do you know your model keeps performing in the way you carefully designed it? Bringing your model to production without adding any monitoring is like flying on autopilot, but blindfolded.
Adding a mature monitoring setup to your model deployments can be a daunting tasks that is often pushed off to the bottom of the to-do list, or put off entirely. How can we, Data Scientists and ML Engineers, introduce monitoring earlier in the MLOPS process and make it part of your deployment right from the start? This talk offers a practical setup to implement ML monitoring in your project using Prometheus and other open-source tools.
In the Python open-source eco-system, many packages are available that cater to:
- the building of great algorithms
- the visualization of data
- back-end functions
Despite this, over 85% of Data Science Pilots remain pilots and do not make it to the production stage.
With Taipy, Data Scientists/Python Developers will be able to build great pilots as well as stunning production-ready applications for end-users.
Taipy provides two independent modules: Taipy GUI and Taipy Core.
In this talk, we will demonstrate how:
- Taipy GUI goes way beyond the capabilities of the standard graphical stack: Streamlit, Dash, etc.
- Taipy Core is simpler yet more powerful than the standard Python back-end stack: Airflow, MLFlow, etc.
Using data in new and unexpected ways to solve real problems for real people - from farmers in Africa to refugees and the war in Ukraine
Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?
Devcontainers are an open-source specification, which allow you to connect your IDE to a running Docker container and develop right inside it. This has numerous advantages. Because the dev environment is now formally defined, it is reproducible. This means others can easily reproduce your dev environment, too! This makes it much easier for others to join in on your project, and stay updated with changes to the environment.
In this talk, you will learn: why you might want to use a Devcontainer for your project (or when not 😉), what exactly a Devcontainer is, and how you can build one for your Python project 🐍.
The most popular data science development tools have largely been developed by academics as scratch pads for interactive data exploration. Jupyter notebooks, for instance, were developed 20 years ago at Berkeley (they were called iPython notebooks at the time). Because of their flexibility and interactivity, these tools have become widespread amongst coding data scientists. More recently, GUI-based tools have begun to be popular. They reduce the technical load on the user, but typically lack much needed flexibility and interoperability. Both avenues of innovation are wildly inadequate for modern data science development. GUI-based tools are typically too expensive, too restrictive, and too closed. The development of automated machine learning tools only made this problem worse, with dozens of software startups urging business analysts to start building machine learning solutions, often with questionable results and even more questionable customer retention metrics. On the other hand, notebook-based solutions are typically too error-prone, too loose, and too isolated to be sufficient. The result is intractable challenges around collaboration, communication, and deployment. The most recent entrants into the notebook space have only marginally improved the experience without fixing the underlying flaws. This talk discusses the fundamental flaws with the way these tools have been developed and how they currently function. Advancement in this space will require reworking the architecture and functionality of these tools at some of the most basic levels. These fixes include things like multiprocessing capabilities; real-time collaboration tools; safe, consistent code execution; easy API deployment; and portable communication tools. Future innovation in the data science development experience will have to tackle these problems and more in order to be successful.
Other people who use your datasets is nice, but updating the logic behind it could cause breaking dashboards and ML models down the line. In this talk I will explain how to prevent these stressful situations by applying unit testing to your data or preprocessing pipelines in Python.
Code archaeology is figuring out what a thing is for, who built it, and how you can get it to run again.
Dealing with legacy code artefacts (while under time pressure) is something we data people encounter a lot in daily life. I will tell about my experiences from both a research and software engineering standpoint. After quickly going over some common sense approaches, I will dive deeper into real-world archaeology and digital forensics, and find out what we can learn from these fields to make dealing with old artefacts a bit easier. Expect a mix of code and non-code hacks, with ample pop culture archaeology memes.
Cognitive impairment is common amongst patients with primary brain tumors (PBT). The exact mechanism by which primary brain tumors affect different cognitive functions, however, is not well understood. Cognitive impairment in PBT patients is likely the result of local effects of the tumor, global effects of the tumor, and patient characteristics. Finding predictors, or the potentially complex interactions between them, may improve our understanding of how different variables influence cognitive function. Moreover, this may facilitate personalized prediction of cognitive function aid personalized treatment decisions. Several big challenges arise when aiming to make personalized predictions of cognitive functioning in PBT patients, many of these problems likely generalize to other applied machine learning tasks.
In machine learning projects we need to experiment in order to find and maintain the best-performing model. While we can do initial prototyping in a Notebook, eventually we need to move towards more structured experiment tracking to facilitate reproducibility of our experiments.
The open-source DVC library aims to tackle this problem through a Git-based approach to versioning data and artifacts. In this talk we will explore how DVC works, how we can apply it to conduct ML experiments, and how we can use it to become a great Pokémon trainer.
When A/B testing is not possible but we are still interested in drawing causal conclusions from our data, we need to resort to quasi-experimental approaches. This is the landscape that Just Eat Takeaway.com is navigating in, where we often have experimental data about a specific city, and are interested in knowing what the effects would be on another city. When we drop the requirement of causality and are merely interested in generating likely scenarios, we can use the power of predictive modelling to our advantage. From predicting likely future scenarios, to generating synthetic order data on a minute-to-minute basis, all is possible using the right statistical tools. Even in the absence of pure experimental data, we are still able to model likely futures. This talk is relevant for data scientists that are interested in the intersection of statistics and predictive modelling, and some basic knowledge about these topics will be assumed. The first half of the presentation (0-15) will talk about quasi-experimental models, the second half (15-30) will talk about scenario and data generation.
This talk is about machine learning package development. I will speak about the pains and benefits it causes for developers and share why open sourcing makes the package even better. The talk is not focused on the package itself but rather on common problems so it will be interesting for a wide range of data scientists and python developers.
PyData provides a forum for an international community of users and developers to share ideas and learn from each other. So let’s connect, come to this interactive session to meet people from other cultures, with new questions and fresh perspectives.
Sharing knowledge and experience with others is not only rewarding but also actually improves your professional skills. CodeMasters offers skills of the future to those who need it most. During a 10 week program participants are supported in learning how to code to grow towards a career in the Netherlands.
In this talk, we will present DuckDB. DuckDB is a novel data management system that executes analytical SQL queries without requiring a server. DuckDB has a unique, in-depth integration with the existing PyData ecosystem. This integration allows DuckDB to query and output data from and to other Python libraries without copying it. This makes DuckDB an essential tool for the data scientist. In a live demo, we will showcase how DuckDB performs and integrates with the most used Python data-wrangling tool, Pandas.
An ever increasing number of people are discovering mobile grocery shopping as an alternative to brick-and-mortar supermarkets. This talk will cover how we can use machine learning to make these customers' grocery shopping as smooth and frictionless as possible. We do this by applying ML models that rank products in agreement with the customer’s intent: e.g., by detecting personal shopping habits, and by striking a balance between query relevance and margin.