Company
Date Published
Author
Jonas Mueller, Mayank Kumar, Hui Wen Goh, Hang Zhou
Word count
1108
Language
English
Hacker News points
2

Summary

Our new algorithm detects errors in numerical datasets by fitting a regression model to predict values based on other columns in the dataset. It accounts for uncertainty sources and is more effective than conformal inference or RANSAC. The algorithm scores each datapoint based on how likely its value is correct, as well as estimating how many datapoints' values were corrupted. This allows for prioritizing the most-likely corrupted datapoints for subsequent review. Our approach works with any regression model that supports a standard fit() and predict() interface, making it versatile for various types of data including image, text, and audio data with associated numerical outcomes. The algorithm uses the regression model to estimate aleatoric and epistemic uncertainties, producing quality scores that reduce opportunities for incorrect datapoints to be misclassified. By sorting by these scores and fitting multiple copies of the regression model with bootstrap resampling, we can estimate the fraction of corrupted datapoints in the dataset. Our algorithm has been benchmarked on 5 real numerical datasets with naturally-occurring errors, showing significant improvements over alternative approaches like conformal inference and RANSAC. The cleanlab library provides a simple Python code to run this algorithm on your data, making it easily accessible for automatic validation of your own datasets.