0.7
cfp
PyData Eindhoven 2022
2022-12-02
2022-12-02
1
00:05
https://eindhoven2022.pydata.org/cfp/schedule/
Europe/Amsterdam
2022-12-02T09:00:00+01:00
09:00
00:30
Auditorium
cfp-77-ai-ethics-in-the-wild-welcome-to-the-jungle
https://eindhoven2022.pydata.org//cfp/talk/SMVQQ3/
false
AI Ethics in the Wild - Welcome to the Jungle
Talk
en
All organizations will need to become data-driven organizations, or they will go the way of the dinosaur. However, AI scales risk to organizational brand and profit. Trustworthy and Ethical AI are no longer luxuries, but business necessities. Let's explore together, why bias is not exclusive to AI, why technology has never been neutral and why Data Science has little to do with Science!
Marc is AI & Ethics lead at KPMG, Managing Consultant, recovering Data Scientist and Public Speaker.
Marc van Meel
2022-12-02T09:50:00+01:00
09:50
00:30
Auditorium
cfp-15-a-tour-of-the-many-dataframe-frameworks
https://eindhoven2022.pydata.org//cfp/talk/FE9LM3/
false
A Tour of the Many DataFrame Frameworks
Talk
en
Processing tabular data has been of the most common operations for data scientists and engineers for a while now. A few years ago, pandas was the single tool of reference for it, but is it still true today?
In this talk, we will review and compare the existing dataframe frameworks to see how they solve the challenges of performance, scalability and user experience.
Processing tabular data has been of the most common operations for data scientists and engineers for a while now. A few years ago, pandas was the single tool of reference for it, but is it still true today?
The increase in the size of the datasets and in the diversity of the use-cases has highlighted many challenges regarding performance, scalability and user experience. The ecosystem has evolved to now include many new alternatives, each of them tackling one or more of those dimensions differently. Some of them even put SQL back under the spotlight!
In this talk we will deep dive into the internals of tabular data processing and look at how the main players of the ecosystem work under the hood. After defining the fundamentals, we will zoom on their APIs and memory models through various examples, so that the audience can get an illustrated comparison between frameworks.
Harizo Rajaona
2022-12-02T10:55:00+01:00
10:55
00:30
Auditorium
cfp-66-is-it-a-predictive-model-is-it-causal-inference-well-it-is-running-a-greenhouse-
https://eindhoven2022.pydata.org//cfp/talk/AQU7KB/
false
Is it a predictive model? Is it causal inference? Well... It is running a greenhouse.
Talk
en
In this talk I hope to convince you that models are not either predictive or causal, but both perspectives should be combined to solve real world problems. I will use a concrete example of how we automate irrigation in greenhouses at Source.
The causal revolution has taught us that there is a world beyond generating predictions. We now known that not all ML models are suitable for causal inference and have alternatives like double machine learning for causal inference.
During this talk, I hope to convince you that the difference between predictive and causal models is not as clear cut as you might think. Using a concrete example of how we control irrigation in greenhouses using machine learning, I will give an example of how to break down a problem into model components that are more or less predictive or causal. Moreover, I hope to give you some practical guidelines on how to decide whether a predictive or causal approach is more suitable for the components of your model.
Outline:
- Causality warm-up
- Explanation of irrigation in greenhouses
- Demonstration of the caveat with predictive models
- Demonstration of why feature selection matters more then framework selection
- Brief introduction in double machine learning (full explanation is beyond the scope of this presentation)
- Demonstration of why double machine learning does not solve the feature selection problem
- Optional: link to Judea Pearl's causal graphs
- Explanation of how to isolate part of the problem where you can use predictive models
- Explanation how this components come together in our solution for irrigation control at Source
- General advice on how to identify whether to use a predictive or causal approach
- Conclusion
/media/cfp/submissions/AQU7KB/IMG-20220401-WA0002_1_31GM7Or.jpg
Ruben Mak
2022-12-02T11:35:00+01:00
11:35
00:30
Auditorium
cfp-4-data-storytelling-through-visualization
https://eindhoven2022.pydata.org//cfp/talk/MHA9YZ/
false
Data Storytelling through Visualization
Talk
en
Data is everywhere. It is through analysis and visualization that we are able to turn data into *information* that can be used to drive better decision making. Out-of-the-box tools will allow you to create a chart, but if you want people to take action, your numbers need to tell a compelling story. Learn how elements of storytelling can be applied to data visualization.
Data is everywhere. It is through analysis and visualization that we are able to turn data into *information* that can be used to drive better decision making. Out-of-the-box tools will allow you create a chart, but if you want people to take action, your numbers need to tell a compelling story.
This talk will show, through numerous examples, how elements of storytelling can be applied to data visualization to uncover the story hidden in your data.
Additionally, we'll question how objective data visualizations really are. Seemingly small alterations to a chart, such as the title of point of comparison, may drive the viewer to wildly different conclusions. What can you do to guide viewers towards a specific (positive or negative) conclusion? Can a graph be truly neutral?
This will leave you both with a better understanding of how graphs should be interpreted, as well as the ability to better convey the meaning of your data through visualization.
Marysia Winkels
2022-12-02T13:25:00+01:00
13:25
00:30
Auditorium
cfp-80-ai-for-good-then-and-now
https://eindhoven2022.pydata.org//cfp/talk/EDYJCV/
false
AI for Good- Then and Now
Talk
en
Using data in new and unexpected ways to solve real problems for real people - from farmers in Africa to refugees and the war in Ukraine
"AI? Back in my day, we just called it Data and Statistics. And it wasn't sexy at all.
But we still did the same thing - we use data to improve people's decision making.
To help improve people's lives."
As a Managing Data Scientist, Marijn Markus has over 6 years of experience in the field.
In this talk, Marijn Markus will share his views and experience on Data Science
The vast differences between theory and practice, the dysfunctionality of organizations,
And how Data can be applied to change lives
From burnouts, refugees, farming and fighting crime
To the conflict in Ukraine today.
/media/cfp/submissions/EDYJCV/Schermafbeelding_2022-11-15_115650_4pp02mn.png
Marijn Markus
2022-12-02T14:05:00+01:00
14:05
00:30
Auditorium
cfp-12-bulk-labelling-techniques
https://eindhoven2022.pydata.org//cfp/talk/BLQJCN/
false
Bulk Labelling Techniques
Talk
en
Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?
Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?
It turns out there are a few techniques that can totally help you with this. You can easily get interesting subset annotated quickly by leveraging:
- a quick search engine
- pre-trained models
- sentence/image embeddings
- a trick to generate phrase embeddings
In this talk I will explain these techniques for bulk labelling whil I will also highlight some tools to get all of this to work. In particular you'll see:
- lunr.py (a lightweight search engine)
- sentimany (a library with pretrained sentiment models)
- embetter (adds pretrained embeddings for scikit-learn)
- umap (an amazing dimensionality reduction library)
- spaCy (a great NLP tool)
- sense2vec (phrase embeddings trained on reddit)
- bulk (a user interface for bulk labelling embeddings)
For this talk I'll assume you're familiar with scikit-learn and that you've heard of embeddings before.
Vincent Warmerdam
2022-12-02T15:10:00+01:00
15:10
00:30
Auditorium
cfp-46-predicting-cognitive-impairment-in-patients-with-a-primary-brain-tumor-a-machine-learning-perspective
https://eindhoven2022.pydata.org//cfp/talk/MM8CFA/
true
Predicting Cognitive Impairment in Patients With a Primary Brain Tumor: A Machine Learning Perspective
Talk
en
Cognitive impairment is common amongst patients with primary brain tumors (PBT). The exact mechanism by which primary brain tumors affect different cognitive functions, however, is not well understood. Cognitive impairment in PBT patients is likely the result of local effects of the tumor, global effects of the tumor, and patient characteristics. Finding predictors, or the potentially complex interactions between them, may improve our understanding of how different variables influence cognitive function. Moreover, this may facilitate personalized prediction of cognitive function aid personalized treatment decisions. Several big challenges arise when aiming to make personalized predictions of cognitive functioning in PBT patients, many of these problems likely generalize to other applied machine learning tasks.
In many prediction tasks, we encounter similar challenges. Small sample sizes, weak and high dimensional predictors, and lots of noisy and difficult to interpret outcome measures to name a few. Problems we need to solve without hurting explainability. In this talk, I plan to address several problems that we encountered while predicting cognitive functioning in primary brain tumor patient that we believe to generalize well to many applied predictions tasks.
More specifically, I will dive into the following topics:
- Primary brain tumors, cognitive function, and treatment options
- Formulating our modeling problem and the challenges we often face
- Reduce the dimensionality of our noisy and high dimensional output variable. What are our options?
- Current challenges when segmenting brain tumors, the reasons deep learning models are still not good enough in practice.
- Evaluating a large set of models to predict cognitive function without introducing bias using Double Loop Cross Validation
- Using Multidimensional Scaling to obtain a low dimensional representation of tumor location. How representing data based on similarity may help creating more meaningfull embedding space
Sander Boelders
2022-12-02T15:50:00+01:00
15:50
00:30
Auditorium
cfp-20-becoming-a-pokmon-master-with-dvc-reproducible-machine-learning-experiments
https://eindhoven2022.pydata.org//cfp/talk/NPZJMM/
false
Becoming a Pokémon Master with DVC: reproducible machine learning experiments
Talk
en
In machine learning projects we need to experiment in order to find and maintain the best-performing model. While we can do initial prototyping in a Notebook, eventually we need to move towards more structured experiment tracking to facilitate reproducibility of our experiments.
The open-source DVC library aims to tackle this problem through a Git-based approach to versioning data and artifacts. In this talk we will explore how DVC works, how we can apply it to conduct ML experiments, and how we can use it to become a great Pokémon trainer.
Every data scientist has at one point kept track of their experiments on paper, sticky notes, or in a spreadsheet. But how can we guarantee reproducibility for potentially thousands of experiments over numerous years? Can we figure out which version of a model ran in production six months ago, and what data went into its training?
The talk is aimed at data scientists and explores best practices for ML projects using a light-hearted topic. Some general knowledge of how ML works is expected, but not necessary to understand the talk. The key concept is reproducibility: how can we track and version not just code, but entire experiments?
DVC is a potential solution for this. The philosophy behind it can be summarized as "Git for data and models". I will discuss its concepts and show how it works in practice for a classifier of Pokémon sprites.
The main takeaway will be the importance of reproducibility and a demo on how to achieve this.
Rob de Wit
2022-12-02T16:30:00+01:00
16:30
00:30
Auditorium
cfp-76-duckdb-bringing-analytical-sql-directly-to-your-python-shell-
https://eindhoven2022.pydata.org//cfp/talk/7SXUAZ/
false
DuckDB: Bringing analytical SQL directly to your Python shell.
Talk
en
In this talk, we will present DuckDB. DuckDB is a novel data management system that executes analytical SQL queries without requiring a server. DuckDB has a unique, in-depth integration with the existing PyData ecosystem. This integration allows DuckDB to query and output data from and to other Python libraries without copying it. This makes DuckDB an essential tool for the data scientist. In a live demo, we will showcase how DuckDB performs and integrates with the most used Python data-wrangling tool, Pandas.
The talk is catered primarily towards data scientists and data engineers. The talk aims to familiarize users with the design differences between Pandas and DuckDB and how to combine them to solve their data-science needs. We will have an overview about five main characteristics of DuckDB. 1) Vectorized Execution Engine, 2) End-to-end Query Optimization, 3) Automatic Parallelism, 4) Beyond Memory Execution, and 5) Data Compression. In addition, users will also experience a live demo of DuckDB and Pandas in a typical data science scenario, focusing on comparing their performance and usability while showcasing their cooperation. The demo is most interesting for an audience familiar with Python, the Pandas API, and SQL.
Pedro Holanda
2022-12-02T09:50:00+01:00
09:50
00:30
Ernst-Curie
cfp-78-thompson-sampling-for-personalising-a-car-brands-advertisements
https://eindhoven2022.pydata.org//cfp/talk/CS3TAK/
false
Thompson sampling for personalising a car brands advertisements
Talk
en
With targeted ads becoming more prevalent in the digital landscape, we share how we used **Thompson sampling** and a **Hierarchical Bayesian Algorithm** that makes its own decisions and serves the right ad to the right audience.
In this talk, we want to share how we used **Thompson sampling** and **self-learning models** to improve targeted advertising and create *modular ads that learn and adapt based on real-time data*. We want to share how we built a model, connected it to real-time data, and set it up so that it can change and improve during, rather than after, a campaign. We also discuss some challenges you might face when using these models and how you can overcome them.
Targeted ads are becoming more prevalent in the digital landscape. These ads can be less intrusive for consumers while at the same time helping businesses reach their preferred audience more effectively. In the current day and age, this also touches upon privacy and the soon-to-be cookieless era of the internet. So how does targeted advertising work with all these changes? Can we scale targeted ads, delivering them to a large audience while keeping them personal and relevant to the individual?
We made this possible by combining the knowledge of Data Scientists, Machine Learning Engineers, Marketing Specialists, Creative Developers, and Designers. Through this collaboration we built a **Hierarchical Bayesian Algorithm** that makes its own decisions and serves the right ad to the right audience.
Nico van EngelenhovenJulien Hamerlinck
2022-12-02T10:55:00+01:00
10:55
00:30
Ernst-Curie
cfp-49-using-deep-learning-to-reduce-flight-delays-at-schiphol-airport
https://eindhoven2022.pydata.org//cfp/talk/B8BYYJ/
true
Using Deep Learning to Reduce Flight Delays at Schiphol Airport
Talk
en
Inefficiencies in the flight preparation processes (turnaround) are accountable for around 30% of the total delays at Royal Schiphol Group (the Amsterdam airport). This process has been a black box and for this reason, it was quite hard to improve. To open the turnaround black box, Schiphol has developed technology based on computer vision using deep learning that detects many different turnaround-related tasks from images that are streamed from cameras located in the aircraft ramps in real time. In this session, we will explain how this project started, the technologies that we have applied, and the business impact that is generated at enabling the airport to reduce delays.
The Amsterdam airport Schiphol is one of the busiest airports in Europe. Schiphol hosts more than 120 airlines and it is connected to 316 destinations around the globe. In this setup, for Schiphol it is key to keeping a high on-time performance (proportion of flights with no delays), not only to guarantee good service as an airport but also to contribute to the good service of the whole network of airports that Schiphol is connected with.
Schiphol has identified that 30% of the total delays are caused by inefficiencies related to the flight preparation process (turnaround) such as fuelling, passenger boarding, baggage handling, etc. To improve the turnaround process, many different tasks should be measured and monitored to find inefficiencies that can be improved.
In the past, to have reliable information about all relevant turnaround-related tasks, many sensors in vehicles and airplanes needed to be installed and maintained. However, given the number of vehicles involved and many airlines, gathering information and maintaining thousands of sensors was a monumental task making this approach financially unviable.
Today, to open the turnaround black box, Royal Schiphol Group has developed a technology that uses deep learning and computer vision to make real-time detections of more than 33 different turnaround-related tasks from images streamed from cameras located at the aircraft ramps (parking spots). From these detections, Schiphol has developed actionable insights to avoid delays by using real-time alert systems and has gained a deep understanding of the inefficiencies of the turnaround process by analyzing historical data.
In this talk, we will introduce the technologies that we have applied (such as data centric AI), learnings, challenges, and the business impact.
/media/cfp/submissions/B8BYYJ/E7AC3BFB-B49C-4510-9262-E4BF0DDFE88D_UOUOP8V.jpeg
santiago ruizTosca van Meer
2022-12-02T11:35:00+01:00
11:35
00:30
Ernst-Curie
cfp-55-lowering-the-barrier-for-ml-monitoring
https://eindhoven2022.pydata.org//cfp/talk/QAAY3X/
false
Lowering the barrier for ML monitoring
Talk
en
Building and fine-tuning models is exciting, but how do you know your model keeps performing in the way you carefully designed it? Bringing your model to production without adding any monitoring is like flying on autopilot, but blindfolded.
Adding a mature monitoring setup to your model deployments can be a daunting tasks that is often pushed off to the bottom of the to-do list, or put off entirely. How can we, Data Scientists and ML Engineers, introduce monitoring earlier in the MLOPS process and make it part of your deployment right from the start? This talk offers a practical setup to implement ML monitoring in your project using Prometheus and other open-source tools.
The ML ecosystem focuses a lot on getting models to production. However, that should not be the end goal, it’s merely the beginning of extracting real value from your model. During this talk, we will discuss:
- Why monitoring your ML model is important
- How traditional software monitoring can be used for ML systems
- What additional elements are required for ML systems
- How to recognise data drift and target drift
- Which tools are promising for ML monitoring
- A scenario for a minimal monitoring setup using open-source tools
/media/cfp/submissions/QAAY3X/Xccelerated_Wesley_cropped_wzjb5Jf.jpeg
Wesley Boelrijk
2022-12-02T14:05:00+01:00
14:05
00:30
Ernst-Curie
cfp-51-how-to-create-a-devcontainer-for-your-python-project-
https://eindhoven2022.pydata.org//cfp/talk/YPULZZ/
false
How to create a Devcontainer for your Python project 🐳
Talk
en
Devcontainers are an open-source [specification](https://containers.dev/), which allow you to connect your IDE to a running Docker container and develop right inside it. This has numerous advantages. Because the dev environment is now formally defined, it is _reproducible_. This means others can easily _reproduce_ your dev environment, too! This makes it much easier for others to join in on your project, and stay updated with changes to the environment.
In this talk, you will learn: why you might want to use a Devcontainer for your project (or when not 😉), what exactly a Devcontainer is, and how you can build one for your Python project 🐍.
Devcontainers have been gaining traction lately. Whereas previously the technology existed only in the umbrella of Visual Studio Code, it is now released as an [open specification](https://containers.dev/). Such, multiple IDE's could all use the same standard specification, promoting reusability and standardisation. That said, Developers are currently hard at work at pushing the technology to become standardised. Especially for these reasons, this is an exciting time to take a closer look at this new specification, and at what the technology can do for us in general.
So how will I go about this talk? Let's take a look 🙌🏻.
## 📝 Talk setup
Let's learn about Devcontainers together. This will be the setup of my talk:
1. Why Devcontainers? What problem do they aim to solve? Pro's & Con's.
1. Building a basic Devcontainer from scratch
1. Opening up the Devcontainer
1. Extending the Devcontainer with more useful features
- Custom VSCode settings
- Running your CI task in the Devcontainer
- Connecting as a non-root user
- Opening up a port to the Devcontainer
1. Going further 🔮
- More useful links & resources
1. Concluding ✓
## 🏡 What you will take home
At the end of the talk, you will be taking home the following:
- When it makes sense to create one
- How you can create one
- Knowledge on how Devcontainers work
- A template repo for a Python project Devcontainer
## ❤️ Open Source Software
Devcontainers are completely open-source. Both the [specification](https://containers.dev/) and the implementations are open-source. The editor that has support for this spec is VSCode, which is also open source.
## 🎒 Pre-requisites
No need to pack anything extra in your bag of knowledge. This talk will not assume you have any existing knowledge on Devcontainers.
## 👥 About the speaker
The speaker has worked with Devcontainers for many months, introducing and implementing it for both existing and new projects at various companies. After having followed the technology for a while, it is possible to see how the technology has changed and where it's going. The speaker has promoted Devcontainers at knowledge exchanges, on a blog and in talks.
/media/cfp/submissions/YPULZZ/without-ship-taller-img_kTPauZy.png
Jeroen Overschie
2022-12-02T15:10:00+01:00
15:10
00:30
Ernst-Curie
cfp-64-how-to-not-pull-your-hair-out-while-providing-data-to-the-business-unit-testing-for-your-data-pipelines
https://eindhoven2022.pydata.org//cfp/talk/HVQZR7/
false
How to not pull your hair out while providing data to the business: unit testing for your data pipelines
Talk
en
Other people who use your datasets is nice, but updating the logic behind it could cause breaking dashboards and ML models down the line. In this talk I will explain how to prevent these stressful situations by applying unit testing to your data or preprocessing pipelines in Python.
Unit testing is a testing method that is often used for building software. It helps in building more robust applications, so as a developer you can release with more confidence and less errors in production. But can you use this technique as a data engineer or data scientist for you preprocessing or pipeline code as well?
It turns out you can! And you can reap the same benefits that software engineers experience, such extra confidence before deployment, in your day-to-day work!
In this talk I will walk you through the conceptual idea of unit testing for data or preprocessing pipelines and provide an example on how to apply it to a very simple use case that uses Pandas. The example will test some transformations on beer data 😉.
I will walk through a five step process with code examples in Python. That way data scientists and data engineers have practical guidance on how to apply it to their own projects by showing how to:
- Define what the logic is that you want to test within your project;
- Refactor your code to separate specific parts in your code;
- Generate synthetic data for testing purposes;
- Put your refactored code under tests using Pytest;
- Set up a CI pipeline in your Git provider in order to test code automatically.
Lars Hanegraaf
2022-12-02T15:50:00+01:00
15:50
00:30
Ernst-Curie
cfp-34-causal-inference-and-scenario-generation-within-just-eat-takeaway-com
https://eindhoven2022.pydata.org//cfp/talk/Z9CGJT/
false
Causal inference and scenario generation within Just Eat Takeaway.com
Talk
en
When A/B testing is not possible but we are still interested in drawing causal conclusions from our data, we need to resort to quasi-experimental approaches. This is the landscape that Just Eat Takeaway.com is navigating in, where we often have experimental data about a specific city, and are interested in knowing what the effects would be on another city. When we drop the requirement of causality and are merely interested in generating likely scenarios, we can use the power of predictive modelling to our advantage. From predicting likely future scenarios, to generating synthetic order data on a minute-to-minute basis, all is possible using the right statistical tools. Even in the absence of pure experimental data, we are still able to model likely futures. This talk is relevant for data scientists that are interested in the intersection of statistics and predictive modelling, and some basic knowledge about these topics will be assumed. The first half of the presentation (0-15) will talk about quasi-experimental models, the second half (15-30) will talk about scenario and data generation.
Within Just Eat Takeaway.com we are often interested in knowing the causal effect of a certain treatment, such as a price change or a marketing campaign, on the predicted order volume. However, in the absence of a pure A/B test, we need to be smart. The first half of the presentation (0-15 minutes) will delve deeper into the problem of causal modelling for quasi-experiments. In particular, we will go over some statistical models that are able to estimate the counterfactuals. A counterfactual can be seen as a what-if: What would have happened if we did (or did not) deploy a certain treatment? Some techniques that are used within Just Eat Takeaway.com to solve this problem are difference in difference and synthetic control. We will delve a little deeper into these techniques and show the audience how they can be used to answer our question.
The second half of the presentation (15-30) will be concerned with scenario generation. If we drop the requirement of causality, and are merely interested in generating scenarios that are historically correlated with our treatment, we can use advanced predictive models to generate several futures for multiple values of our treatment. For example, we could predict several futures of order volume by considering several price changes over time. I will briefly mention a recent promising model that is able to deal with such multivariate time series called the Temporal Fusion Transformer. I will end the presentation with models for synthetic data generation, called Gaussian copulas, that are able to generate realistic order data on a minute to minute basis given the possible futures that we predict. This data can be used to predict how many couriers we would need in a city to successfully fulfil the demand and to estimate what possible hiccups there could be.
The takeaway of the talk is that even in the absence of pure experimental data, we are still able to model likely futures and to act upon this.
Max Knobbout
2022-12-02T16:30:00+01:00
16:30
00:30
Ernst-Curie
cfp-42-everything-in-its-right-place-optimising-ranking-in-online-grocery
https://eindhoven2022.pydata.org//cfp/talk/NMZHHQ/
false
Everything in its Right Place: Optimising Ranking in Online Grocery
Talk
en
An ever increasing number of people are discovering mobile grocery shopping as an alternative to brick-and-mortar supermarkets. This talk will cover how we can use machine learning to make these customers' grocery shopping as smooth and frictionless as possible. We do this by applying ML models that rank products in agreement with the customer’s intent: e.g., by detecting personal shopping habits, and by striking a balance between query relevance and margin.
In online grocery, the wide range of available choices can easily overwhelm a customer. Moreover, failure to find the desired products may lead to customers not converting at all. It’s therefore crucial to optimise ranking, in accordance with the customer’s intent; and to construct sensible algorithms that capture this intended behaviour.
In this talk, I will provide a holistic view of how we approach ranking in the online grocery context. Depending on an app page’s intended functionality, we might aim to make rebuying as frictionless as possible, while elsewhere we personalise search query relevance while not losing sight of margin. More concretely, I will discuss how we have set up ranking in an explainable and interpretable way that allows for a balance between relevance, profit and any other business-based concerns there might be. In addition, I will briefly discuss three algorithms that we have developed and implemented, and how these are combined to optimise the customer experience:
- prediction of rebuying probabilities through detecting personal shopping habits
- construction of unbiased search term-article relevances through structural position bias corrections
- personalisation of search results while taking profitability into account
This talk will provide the application-minded Data Scientist with an inside view into the deliberations that inform our ranking algorithms and setup.
Bas Vlaming
2022-12-02T09:50:00+01:00
09:50
00:30
Planck
cfp-14-predictive-maintenance-at-asml
https://eindhoven2022.pydata.org//cfp/talk/C9TDVU/
false
Predictive Maintenance at ASML
Talk
en
In the chip industry, time is money. Customers of ASML’s lithography systems expect high uptimes. But expected and unexpected maintenance is part of that equation, sometimes requiring to halt the production temporarily.
In this presentation, we show you how we are building and deploying Machine Learning models to predict upcoming maintenance actions within the upcoming three months. Our work helps to boost productivity, maximize system utilization and reduce unexpected workload for ASML’s customer support.
In the chip industry, time is money. Customers of ASML’s lithography systems expect high uptimes. But expected and unexpected maintenance is part of that equation, sometimes requiring to halt the production temporarily.
In this presentation, we show you how we are building and deploying Machine Learning models to predict upcoming maintenance actions within the upcoming three months. Our work helps to boost productivity, maximize system utilization and reduce unexpected workload for ASML’s customer support.
Anjan Prasad GantaparaHamideh Rostami
2022-12-02T10:55:00+01:00
10:55
00:30
Planck
cfp-25-fuzzytm-a-python-package-for-fuzzy-topic-models
https://eindhoven2022.pydata.org//cfp/talk/CDE9AL/
false
FuzzyTM: a Python package for fuzzy topic models
Talk
en
We present FuzzyTM, a Python library for training fuzzy topic models and creating topic embeddings for downstream tasks. Its modular design allows researchers to modify each software element and for future methods to be added. Meanwhile, the user-friendly pipelines with default values allow practitioners to train a topic model with minimal effort.
The volume of data/information created is growing exponentially and forecasted to reach 181 zettabyte by 2025. Approximately 80% of today’s data is composed of unstructured or semi-structured data. Analyzing all this data is time intensive and costly in many cases. One technique to systematically analyze large corpora of texts is topic modeling, which returns the latent topics present in a corpus. Recently, several fuzzy topic modeling algorithms have been proposed and have shown superior results over the existing algorithms. Although various Python libraries offer topic modeling algorithms, none includes fuzzy topic models. Therefore, we present FuzzyTM, a Python library for training fuzzy topic models and creating topic embeddings for downstream tasks. Its modular design allows researchers to modify each software element and for future methods to be added. Meanwhile, the user-friendly pipelines with default values allow practitioners to train a topic model with minimal effort.
Emil Rijcken
2022-12-02T11:35:00+01:00
11:35
00:30
Planck
cfp-74-turning-your-data-ai-algorithms-into-full-web-apps-in-no-time-with-taipy
https://eindhoven2022.pydata.org//cfp/talk/93XVSZ/
false
Turning your Data/AI algorithms into full web apps in no time with Taipy
Talk
en
In the Python open-source eco-system, many packages are available that cater to:
- the building of great algorithms
- the visualization of data
- back-end functions
Despite this, over 85% of Data Science Pilots remain pilots and do not make it to the production stage.
With Taipy, Data Scientists/Python Developers will be able to build great pilots as well as stunning production-ready applications for end-users.
Taipy provides two independent modules: *Taipy GUI* and *Taipy Core*.
In this talk, we will demonstrate how:
- *Taipy GUI* goes way beyond the capabilities of the standard graphical stack: Streamlit, Dash, etc.
- *Taipy Core* is simpler yet more powerful than the standard Python back-end stack: Airflow, MLFlow, etc.
Initially, we will present how a complete graphical interface can be programmed using *Taipy GUI's* low code syntax (in Python).
We will then introduce *Taipy Core's* main concepts: pipelines, scenarios, data nodes, tasks, caching, etc. We will create pipelines graphically from Python IDEs and submit these pipelines for execution. These pipelines will then be "scenario enabled" to provide powerful what-if analysis for Data Scientists or End-Users.
Finally, complete Python applications integrating *Taipy Core* and *Taipy GUI* will be demonstrated.
/media/cfp/submissions/93XVSZ/taipy_logo_500x500px_Ke8d6Al.png
Vincent Gosselin
2022-12-02T14:05:00+01:00
14:05
00:30
Planck
cfp-79-significant-roadblocks-to-usefulness-for-jupyter-notebooks-and-a-recipe-to-fix-them
https://eindhoven2022.pydata.org//cfp/talk/78FS8J/
false
Significant Roadblocks to Usefulness for Jupyter Notebooks and a Recipe to Fix them
Talk
en
The most popular data science development tools have largely been developed by academics as scratch pads for interactive data exploration. Jupyter notebooks, for instance, were developed 20 years ago at Berkeley (they were called iPython notebooks at the time). Because of their flexibility and interactivity, these tools have become widespread amongst coding data scientists. More recently, GUI-based tools have begun to be popular. They reduce the technical load on the user, but typically lack much needed flexibility and interoperability. Both avenues of innovation are wildly inadequate for modern data science development. GUI-based tools are typically too expensive, too restrictive, and too closed. The development of automated machine learning tools only made this problem worse, with dozens of software startups urging business analysts to start building machine learning solutions, often with questionable results and even more questionable customer retention metrics. On the other hand, notebook-based solutions are typically too error-prone, too loose, and too isolated to be sufficient. The result is intractable challenges around collaboration, communication, and deployment. The most recent entrants into the notebook space have only marginally improved the experience without fixing the underlying flaws. This talk discusses the fundamental flaws with the way these tools have been developed and how they currently function. Advancement in this space will require reworking the architecture and functionality of these tools at some of the most basic levels. These fixes include things like multiprocessing capabilities; real-time collaboration tools; safe, consistent code execution; easy API deployment; and portable communication tools. Future innovation in the data science development experience will have to tackle these problems and more in order to be successful.
The most popular data science development tools have largely been developed by academics as scratch pads for interactive data exploration. Jupyter notebooks, for instance, were developed 20 years ago at Berkeley (they were called iPython notebooks at the time). Because of their flexibility and interactivity, these tools have become widespread amongst coding data scientists. More recently, GUI-based tools have begun to be popular. They reduce the technical load on the user, but typically lack much needed flexibility and interoperability. Both avenues of innovation are wildly inadequate for modern data science development. GUI-based tools are typically too expensive, too restrictive, and too closed. The development of automated machine learning tools only made this problem worse, with dozens of software startups urging business analysts to start building machine learning solutions, often with questionable results and even more questionable customer retention metrics. On the other hand, notebook-based solutions are typically too error-prone, too loose, and too isolated to be sufficient. The result is intractable challenges around collaboration, communication, and deployment. The most recent entrants into the notebook space have only marginally improved the experience without fixing the underlying flaws. This talk discusses the fundamental flaws with the way these tools have been developed and how they currently function. Advancement in this space will require reworking the architecture and functionality of these tools at some of the most basic levels. These fixes include things like multiprocessing capabilities; real-time collaboration tools; safe, consistent code execution; easy API deployment; and portable communication tools. Future innovation in the data science development experience will have to tackle these problems and more in order to be successful.
Greg Michaelson
2022-12-02T15:10:00+01:00
15:10
00:30
Planck
cfp-58-practical-code-archaeology
https://eindhoven2022.pydata.org//cfp/talk/NJVGKJ/
false
Practical code archaeology
Talk
en
Code archaeology is figuring out what a thing is for, who built it, and how you can get it to run again.
Dealing with legacy code artefacts (while under time pressure) is something we data people encounter a lot in daily life. I will tell about my experiences from both a research and software engineering standpoint. After quickly going over some common sense approaches, I will dive deeper into real-world archaeology and digital forensics, and find out what we can learn from these fields to make dealing with old artefacts a bit easier. Expect a mix of code and non-code hacks, with ample pop culture archaeology memes.
Contents:
- Code archaeology: why do we do it, and do we need to bring a hat?
- The basics: common sense approaches to code archaeology
- What can we learn from real-world archaeologists?
- What can we learn from digital forensics?
/media/cfp/submissions/NJVGKJ/thisyearsdatascientists_eIrDokn.png
Judith van Stegeren
2022-12-02T15:50:00+01:00
15:50
00:30
Planck
cfp-61-why-does-everyone-need-to-develop-a-machine-learning-package-
https://eindhoven2022.pydata.org//cfp/talk/KVMZYT/
false
Why does everyone need to develop a machine learning package?
Talk
en
This talk is about machine learning package development. I will speak about the pains and benefits it causes for developers and share why open sourcing makes the package even better. The talk is not focused on the package itself but rather on common problems so it will be interesting for a wide range of data scientists and python developers.
Have you ever wondered how new open source packages emerge? This talk is exactly about it. I will tell how the idea of a package is born and how it transforms from the proof of concept to the first release version. How business could benefit from it and why the development itself is not the hardest part of open source development. And last but not least, why we changed the architecture of our package three times and why you would too!
Andrei Alekseev
2022-12-02T16:30:00+01:00
16:30
00:30
Planck
cfp-81-come-connect-to-the-active-brainport-community
https://eindhoven2022.pydata.org//cfp/talk/L8YEWJ/
false
Come connect to the active Brainport community
Talk
en
PyData provides a forum for an international community of users and developers to share ideas and learn from each other. So let’s connect, come to this interactive session to meet people from other cultures, with new questions and fresh perspectives.
Sharing knowledge and experience with others is not only rewarding but also actually improves your professional skills. CodeMasters offers skills of the future to those who need it most. During a 10 week program participants are supported in learning how to code to grow towards a career in the Netherlands.
PyData provides a forum for an international community of users and developers to share ideas and learn from each other. So let’s connect, come to this interactive session to meet people from other cultures, with new questions and fresh perspectives.
Sharing knowledge and experience with others is not only rewarding but also actually improves your professional skills. CodeMasters offers skills of the future to those who need it most. During a 10 week program participants are supported in learning how to code to grow towards a career in the Netherlands.
Yannic Suurmeijer