Faster Gemma 4 on MLX with multi-token prediction

Post Details

Company

Ollama

Date Published

June 29, 2026

Author

-

Word Count

835

Company Posts That Month

4

Language

-

Hacker News Points

-

Source URL

ollama.com/blog/faster-gemma-4-mlx-mtp

Summary

Gemma 4 in Ollama 0.31 demonstrates a significant speed improvement on Apple Silicon, generating tokens nearly 90% faster due to multi-token prediction (MTP). This enhancement is achieved by employing a small, fast draft model that proposes several tokens, which are then verified by the main model in a single pass, significantly boosting efficiency without altering outputs. The draft model's proposals, often accepted in predictable coding contexts, make coding agents more responsive as they continuously interact with files and tools. The efficiency of MTP is automatically fine-tuned during runtime to optimize performance based on workload, ensuring speculative decoding does not slow down the process when it stops being advantageous. This approach leverages GPU capabilities for simultaneous drafting, sampling, and verification, with a new matrix multiplication kernel enhancing the speed of Gemma 4's computations on specific setups. This advancement benefits real programming tasks, with potential application in other models beyond Gemma 4.

Trends Found in this Post

No tracked trend matches for this post yet.