FireAttention V4: Industry-Leading Latency and Cost Efficiency with FP4

Company

Fireworks AI

Date Published

Oct. 6, 2025

Author

Word count

1086

Language

English

Hacker News points

None

URL

fireworks.ai/blog/fireattention-v4-fp4-b200

Summary

FireAttention V4, an advanced inference engine, has achieved significant milestones in latency and cost efficiency by leveraging the FP4 precision format on NVIDIA B200 GPUs, surpassing 250 tokens per second in independent benchmarks. The Blackwell architecture of NVIDIA's latest GPU generation supports hardware-native micro-scaling, with NVFP4 emerging as the most efficient precision format compared to alternatives like MXFP4 and MXFP8, due to its superior throughput and reduced memory demands. The FireAttention V4 engine has been optimized to accommodate FP4, showcasing substantial throughput improvements over previous models and offering a competitive edge in quality, particularly when evaluated against comprehensive benchmarks like MMLU Pro. Despite inherent quality drops with FP4, these can be mitigated through Quantization-Aware Training (QAT), which allows models to maintain accuracy while optimizing for performance. Enterprise customers can now access B200 deployments with FireAttention V4 using FP4, ensuring optimal latency and cost-effectiveness for demanding applications.