Home / Companies / Ollama / Blog / Post Details
Content Deep Dive

Faster Gemma 4 on MLX with multi-token prediction

Blog post from Ollama

Post Details
Company
Date Published
Author
-
Word Count
835
Company Posts That Month
4
Language
-
Hacker News Points
-
Summary

Gemma 4 in Ollama 0.31 demonstrates a significant speed improvement on Apple Silicon, generating tokens nearly 90% faster due to multi-token prediction (MTP). This enhancement is achieved by employing a small, fast draft model that proposes several tokens, which are then verified by the main model in a single pass, significantly boosting efficiency without altering outputs. The draft model's proposals, often accepted in predictable coding contexts, make coding agents more responsive as they continuously interact with files and tools. The efficiency of MTP is automatically fine-tuned during runtime to optimize performance based on workload, ensuring speculative decoding does not slow down the process when it stops being advantageous. This approach leverages GPU capabilities for simultaneous drafting, sampling, and verification, with a new matrix multiplication kernel enhancing the speed of Gemma 4's computations on specific setups. This advancement benefits real programming tasks, with potential application in other models beyond Gemma 4.

Trends Found in this Post

No tracked trend matches for this post yet.