PyData Eindhoven 2022

Bulk Labelling Techniques
12-02, 14:05–14:35 (Europe/Amsterdam), Auditorium

Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?


Let's say you've to some unlabelled data and you want to train a classifier. You need annotations before you can model, but because you're time-bound you must stay pragmatic. You only have an afternoon to spend. What would you do?

It turns out there are a few techniques that can totally help you with this. You can easily get interesting subset annotated quickly by leveraging:

  • a quick search engine
  • pre-trained models
  • sentence/image embeddings
  • a trick to generate phrase embeddings

In this talk I will explain these techniques for bulk labelling whil I will also highlight some tools to get all of this to work. In particular you'll see:

  • lunr.py (a lightweight search engine)
  • sentimany (a library with pretrained sentiment models)
  • embetter (adds pretrained embeddings for scikit-learn)
  • umap (an amazing dimensionality reduction library)
  • spaCy (a great NLP tool)
  • sense2vec (phrase embeddings trained on reddit)
  • bulk (a user interface for bulk labelling embeddings)

For this talk I'll assume you're familiar with scikit-learn and that you've heard of embeddings before.


Prior Knowledge Expected

Previous knowledge expected

Vincent is a senior data professional who worked as an engineer, researcher, team lead, and educator in the past. He's especially interested in understanding algorithmic systems so that one may prevent failure. As such, he prefers simpler solutions that scale, as opposed to the latest and greatest from the hype cycle.

You may know him from his koaning.io blog, his many open source projects, some of his PyData talks, or his calmcode.io project.

He's also known for giving helpful advice for free, so please feel free to talk to him if you have an interesting data problem.