New in llama.cpp: Model Management
Blog post from HuggingFace
Llama.cpp has introduced a router mode in its server, enabling dynamic model management without requiring server restarts, a feature inspired by Ollama-style model management. This new capability allows users to load, unload, and switch between multiple models seamlessly, using a multi-process architecture that keeps other models running even if one crashes. The server auto-discovers models from caches or specified directories and supports on-demand loading with an LRU eviction policy to manage up to four models by default. It facilitates model selection through the request's model field and supports various configurations via command-line options or presets. Additionally, a web UI is available for model switching, making it easier for developers to conduct A/B testing, implement multi-tenant deployments, and switch models during development without needing to restart the server. The community has responded positively, discussing potential improvements and integrations on platforms like GitHub.