Training Qwen3 VL to label bbox : synthetic data, environment and training analysis
Blog post from HuggingFace
In this article, UlrickBLE discusses the development of a synthetic data pipeline and reinforcement learning (RL) environment to improve small vision-language models (VLMs) for bounding box annotation, specifically focusing on window detection in architectural images. The creation of synthetic data using Three.js allows for high-quality, auto-labeled datasets, overcoming challenges of sourcing and manually labeling real-world data. By training the Qwen 3 VL 2B Instruct model with a reusable RL environment, the author aims to enhance the model's precision in object detection, addressing issues such as miscounting occurrences and missing target areas. Two reward functions, strict IoU and smooth geometry with IoU, are explored to optimize the model's performance. The synthetic data approach offers 100% precision in bounding box generation, proving advantageous over manual labeling, and the article provides insights into the procedural generation of architectural data, RL environment design, and the successful training outcomes achieved using these methods.