CRAFT: Continuous Reasoning and Agentic Feedback Tuning
Blog post from HuggingFace
CRAFT, an advanced framework for text-to-image generation and image editing, enhances compositional accuracy and text rendering by incorporating a reasoning loop that decomposes prompts into structured visual questions and verifies outputs with a Visual Language Model (VLM). This model-agnostic method uses existing tools without retraining, refining prompts only where constraints fail, and iteratively editing images until all constraints are satisfied. Evaluated across various models including FLUX-Schnell and Qwen-Image, CRAFT demonstrates improved visual constraint satisfaction and compositional consistency, particularly excelling in datasets like DSG-1K and Parti-Prompt. Despite its efficiency, the framework's effectiveness heavily relies on the VLM's accuracy, and while it introduces some overhead, this is minimal compared to the performance gains over traditional methods.