Sub-second image generation with Flux.2 and Qwen-Image
Blog post from Baseten
Baseten has significantly optimized image generation serving for Flux.2 [dev] and Qwen-Image models, achieving up to 2.3x and 1.6x speedups on NVIDIA Blackwell GPUs and 1.9x and 1.1x on NVIDIA Hopper GPUs compared to SGLang. These improvements reduce single-request latency, crucial for latency-sensitive applications like creative tools and marketing, enhancing user experience, throughput, and cost efficiency. The optimizations leverage hardware-aware quantization, memory improvements, and specialized kernels, with Baseten FP4 on B200 GPUs delivering notable latency reductions to under one second for Flux.2 [dev] and significant speedups for Qwen-Image. The Baseten Inference Stack supports various image generation parameters, ensuring reliability and efficiency in production settings, and is adaptable to other models, facilitating low-latency, high-reliability deployments.