New model scheduling

Post Details

Company

Ollama

Date Published

Sept. 23, 2025

Author

-

Word Count

297

Language

-

Hacker News Points

-

Source URL

ollama.com/blog/new-model-scheduling

Summary

Ollama has introduced an enhanced model scheduling system that precisely measures memory requirements before executing a model, improving upon the previous estimation-based approach. This advancement reduces out-of-memory crashes by preventing over-allocations and maximizes GPU utilization by allocating more memory to the GPU, thereby increasing token generation and processing speeds. Additionally, the new system optimizes performance across multiple GPUs, enhancing multi-GPU and mismatched GPU operations, and ensures accurate memory utilization reporting, aligning with tools like nvidia-smi. All models on Ollama's new engine now feature this memory management improvement by default, with more models transitioning soon, resulting in significant performance gains as demonstrated with NVIDIA GeForce RTX 4090 GPUs, showing increased token generation and prompt evaluation speeds across supported models.