Company
Date Published
Author
Sanjana Garg, Ulyana Tkachenko, Yiming Chen, ElĂ­as Snorrason, Jonas Mueller
Word count
1729
Language
English
Hacker News points
4

Summary

CleanVision is an open-source Python library that scans image datasets for common real-world issues such as blurry, under/over-exposed, oddly sized, or (near) duplicates of others. This can significantly impact the performance of machine learning models trained on these datasets. Issues detected in popular computer vision datasets like Caltech-256, Food101, CUB-200-2011, and CIFAR-10 include grayscale images, low information content, blurry images, near duplicates, odd aspect ratios, and mislabeled images. These issues can hinder the training of the best possible model on your data and lead to noisy or spurious correlations in the model's outputs. CleanVision offers a systematic approach for detecting these issues using just a few lines of code and can be used to audit most image datasets on a CPU. The library has been successfully tested on multiple famous image datasets, including CIFAR-10, which had the least number of issues amongst the ones evaluated here. By filtering out bad data, CleanVision helps improve the quality of datasets for various computer vision tasks.