Home / Companies / Voxel51 / Blog / Post Details
Content Deep Dive

The NeurIPS 2024 Preshow: Are We Measuring What We Think We Are? The Perils of Contaminated Benchmark Datasets

Blog post from Voxel51

Post Details
Company
Date Published
Author
Harpreet Sahota
Word Count
1,682
Language
English
Hacker News Points
-
Summary

The paper addresses a significant issue in machine learning research, where benchmark datasets are often contaminated with errors, leading to overestimation of model performance and hindering scientific progress. The authors propose SELFCLEAN, a data cleaning method that employs self-supervised learning (SSL) to identify and mitigate data quality issues in benchmark datasets. SELFCLEAN uses two-step process: representation learning using SSL and distance-based indicators to identify potential data quality issues. The method offers two operating modes, fully automated and human-in-the-loop, allowing users to choose between automatic cleaning and manual verification. Experiments demonstrate the effectiveness of SELFCLEAN in detecting off-topic samples, near duplicates, and label errors, highlighting its practical importance for accurate model evaluation and restoring confidence in benchmark results.