Home / Companies / Galileo / Blog / Post Details
Content Deep Dive

🔭 Improving Your ML Datasets, Part 2: NER

Blog post from Galileo

Post Details
Company
Date Published
Author
Ben Epstein
Word Count
1,356
Language
English
Hacker News Points
-
Summary

The authors of the blog post used a data-centric approach with Galileo to improve a Named Entity Recognition (NER) system on the MIT Movies dataset. By inspecting and analyzing errors in the training data, they were able to uncover issues such as mislabeled spans, incorrect span boundaries, and semantic overlap between classes. They filtered out high DEP score spans, relabeled corrected samples, and applied specific filters to address challenging classes like Genre and Actor. After making these corrections, they saw a 3.3 point F1-score improvement on test data, with the majority of gains coming from correcting just 4% of the training data. This demonstrates the potential of Galileo's workflow to save model iterations, GPU costs, and training time, while improving model performance in production.