Company
Date Published
Author
-
Word count
297
Language
-
Hacker News points
None

Summary

Ollama has introduced an enhanced model scheduling system that precisely measures memory requirements before executing a model, improving upon the previous estimation-based approach. This advancement reduces out-of-memory crashes by preventing over-allocations and maximizes GPU utilization by allocating more memory to the GPU, thereby increasing token generation and processing speeds. Additionally, the new system optimizes performance across multiple GPUs, enhancing multi-GPU and mismatched GPU operations, and ensures accurate memory utilization reporting, aligning with tools like nvidia-smi. All models on Ollama's new engine now feature this memory management improvement by default, with more models transitioning soon, resulting in significant performance gains as demonstrated with NVIDIA GeForce RTX 4090 GPUs, showing increased token generation and prompt evaluation speeds across supported models.