Optimizing Llama 4 Maverick on Fireworks AI

Post Details

Company

Fireworks AI

Date Published

Oct. 6, 2025

Author

-

Word Count

1,205

Language

English

Hacker News Points

-

Source URL

fireworks.ai/blog/llama4-maverick

Summary

Meta's Llama 4 Maverick, an innovative Mixture-of-Experts model, is designed to process both text and images with an expanded context window of 1 million tokens, enhancing its capability to manage extensive data like code repositories and detailed product specifications. Fireworks AI quickly incorporated this model, offering superior performance through optimizations such as FP8 quantization, tensor and expert parallelism, and a custom attention mechanism, allowing it to achieve a streaming throughput of 145 tokens per second, outperforming competitors. The platform supports an OpenAI-compatible function-calling interface, enabling users to execute deterministic function calls effectively. Fireworks AI’s infrastructure ensures efficient asynchronous GPU execution and is the first public API for Llama 4, providing the fastest and most expansive context model with native function calling for developers.