FireAttention V2: 12x faster to make Long Contexts practical for Online Inference

Company

Fireworks AI

Date Published

Oct. 6, 2025

Author

Word count

891

Language

English

Hacker News points

None

URL

fireworks.ai/blog/fireattention-v2-long-context-inference

Summary

FireAttention V2 significantly enhances the performance of long context language models (LLMs), making them more practical for online inference, particularly for contexts ranging from 8K to 32K tokens. The Fireworks team has achieved major improvements, such as supporting FP16 and FP8 prefill kernels and introducing multi-host deployment modes beneficial for high-traffic applications. The post critiques existing benchmarks for long contexts, advocating for more comprehensive tests that require reasoning abilities beyond simple retrieval tasks. Benchmarking results show that the open-source Qwen 72B model is effective for long context tasks, while proprietary models also perform well. FireAttention V2 demonstrates superior throughput and latency compared to vLLM, particularly in FP8 mode, across both short-medium and long-generation scenarios. The multi-host mode further amplifies these gains, offering significant improvements in throughput and latency for enterprise customers.