How to Reduce Dataset Size Without Losing Accuracy
Blog post from Roboflow
Data management plays a crucial role in training computer vision models, as demonstrated by an experiment where a dataset of over 39,000 images was reduced by 26% by removing outliers, duplicates, and blurry images while achieving nearly identical model accuracy. The experiment, which used a personal protective equipment dataset and the YOLOv8 object detection model, found that cleaning the dataset resulted in a mean average precision (mAP) score of 76.5%, only slightly lower than the original 79%, indicating that efficiency gains can outweigh minor drops in model performance. This approach emphasizes the importance of maintaining high-quality data by leveraging tools like Roboflow and Fastdup to identify and remove suboptimal data, thereby reducing computational resources and costs. The experiment concluded that the value of this data reduction strategy depends on the specific requirements of each project, such as the need for faster prototyping or lower costs, suggesting that the trade-off might be beneficial in scenarios where absolute top performance is not essential.