Company
Date Published
Author
Chris Mauck, Jonas Mueller
Word count
478
Language
English
Hacker News points
None

Summary

The Office-Home Dataset is a widely used computer vision dataset that contains hundreds of erroneous labels and data issues, including mislabeled examples, ambiguous examples, and outliers, which can be detrimental to modeling and analytics efforts. These errors were discovered using Cleanlab Studio, an automated solution that identifies and fixes data issues using AI. The dataset was curated by collecting images from a web crawler and filtering them to ensure the desired object was in the picture, but this method often produces incorrect image-label pairs. By running the dataset through Cleanlab Studio, researchers can identify and correct these errors, which can improve the accuracy of their models and conclusions.