Home / Companies / Roboflow / Blog / Post Details
Content Deep Dive

How to Reduce Dataset Size Without Losing Accuracy

Blog post from Roboflow

Post Details
Company
Date Published
Author
Arty Ariuntuya
Word Count
1,759
Language
English
Hacker News Points
-
Summary

Data management plays a crucial role in training computer vision models, as demonstrated by an experiment where a dataset of over 39,000 images was reduced by 26% by removing outliers, duplicates, and blurry images while achieving nearly identical model accuracy. The experiment, which used a personal protective equipment dataset and the YOLOv8 object detection model, found that cleaning the dataset resulted in a mean average precision (mAP) score of 76.5%, only slightly lower than the original 79%, indicating that efficiency gains can outweigh minor drops in model performance. This approach emphasizes the importance of maintaining high-quality data by leveraging tools like Roboflow and Fastdup to identify and remove suboptimal data, thereby reducing computational resources and costs. The experiment concluded that the value of this data reduction strategy depends on the specific requirements of each project, such as the need for faster prototyping or lower costs, suggesting that the trade-off might be beneficial in scenarios where absolute top performance is not essential.