Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data

Company

Cleanlab

Date Published

May 30, 2023

Author

Jesse Cummings, Elías Snorrason, Jonas Mueller

Word count

2203

Language

English

Hacker News points

URL

cleanlab.ai/blog/non-iid-detection

Summary

The text discusses the issue of Independent and Identically Distributed (IID) data in machine learning, data science, and statistics/analytics efforts. It highlights that most datasets violate this assumption due to various reasons such as data drift, non-IID sampling, or lack of independence between datapoints. The authors present a method called k-Nearest Neighbors (kNN) to detect when a dataset is not IID. This method constructs a graph of the dataset based on feature values and applies a two-sample permutation test using the Kolmogorov-Smirnov statistic to determine if there is a statistically significant difference between the distributions of index distances between kNN-neighbors and arbitrary datapoint pairs. The text also showcases various examples, including image datasets with concept drift and extreme drift, where cleanlab's non-IID check successfully identifies the issues. Additionally, it highlights additional features such as scoring individual datapoints and handling identically distributed but not independent data.