Training Design for Text-to-Image Models: Lessons from Ablations
Blog post from HuggingFace
In the second part of a series on training efficient text-to-image models, the authors focus on improving training speed, convergence reliability, and learning quality through various techniques, documented as an experimental logbook. The baseline model PRX-1.2B, trained in a standard setup without shortcuts, serves as a reference point for evaluating new training methods like Representation Alignment (REPA), which shows early convergence benefits when used initially and then turned off. Techniques like Contrastive Flow Matching and the JiT approach are explored, with JiT proving beneficial for high-resolution image training without a VAE. Token routing methods like TREAD and SPRINT provide significant throughput gains, especially at higher resolutions, while data choices, such as using long captions and synthetic images, influence training trajectories and final outcomes. Practical details like using the Muon optimizer and avoiding storing weights in bfloat16 are also highlighted for their impact on training efficiency and quality. The authors plan to release the full training recipe and conduct a public speedrun to test these combined methods, inviting community participation and feedback.