How to Reduce Dataset Size Without Losing Accuracy

Post Details

Company

Roboflow

Date Published

Aug. 9, 2023

Author

Arty Ariuntuya

Word Count

1,759

Language

English

Hacker News Points

-

Source URL

blog.roboflow.com/how-to-reduce-dataset-size-computer-vision

Summary

Data management plays a crucial role in training computer vision models, as demonstrated by an experiment where a dataset of over 39,000 images was reduced by 26% by removing outliers, duplicates, and blurry images while achieving nearly identical model accuracy. The experiment, which used a personal protective equipment dataset and the YOLOv8 object detection model, found that cleaning the dataset resulted in a mean average precision (mAP) score of 76.5%, only slightly lower than the original 79%, indicating that efficiency gains can outweigh minor drops in model performance. This approach emphasizes the importance of maintaining high-quality data by leveraging tools like Roboflow and Fastdup to identify and remove suboptimal data, thereby reducing computational resources and costs. The experiment concluded that the value of this data reduction strategy depends on the specific requirements of each project, such as the need for faster prototyping or lower costs, suggesting that the trade-off might be beneficial in scenarios where absolute top performance is not essential.