Company
Date Published
Author
Johnson Kuan, Jonas Mueller, Anish Athalye
Word count
2173
Language
English
Hacker News points
None

Summary

In June 2021, Forbes published an article on the movement towards Data-Centric AI which revolves around the insight that improving the data rather than the model can be more effective in improving the overall performance of AI systems. This approach makes intuitive sense because the quality of Machine Learning (ML) models depends on the quality of the data used to train/evaluate them, as "garbage in, garbage out." Given the abundance of awesome open-source ML modeling packages, the model aspect is more-or-less a solved problem for many business applications, leaving a key challenge in making Data-Centric AI an efficient and systematic process. To address this, new tools focused on data quality for AI are needed. One such tool is cleanlab, which uses an algorithm called “Confident Learning” to automatically find label issues in any dataset. This tool has been used to uncover thousands of label errors in top 10 ML benchmark datasets, highlighting the importance of data quality even in well-studied datasets. The article demonstrates how to use cleanlab to find label issues in audio datasets used for supervised learning, using the Spoken Digit dataset as an example. By leveraging cleanlab, developers can identify and fix label errors in their own datasets, ensuring that their ML models are trained on high-quality data.