Automatically catching spurious correlations in ML datasets
Blog post from Cleanlab
The cleanlab open-source package has released version 2.7.0, featuring a new capability within its Datalab module to automatically detect spurious correlations in datasets, which are irrelevant patterns that can mislead machine learning models and degrade their performance. Spurious correlations, such as associating image darkness with a specific class, can cause models to latch onto non-generalizable features, leading to poor predictions on real-world data. The Datalab module can identify over eight types of issues, such as odd image sizes and grayscale images, thereby enhancing model accuracy and robustness. Two scenarios illustrate the impact of spurious correlations: one where a dataset included darkened images of chicken wings, misleading the model, and another with genuine features that allowed the model to achieve higher accuracy. Datalab's detection of these correlations aids in ensuring that models learn meaningful patterns, contributing to more reliable and trustworthy AI systems.