New in llama.cpp: Model Management

Post Details

Company

HuggingFace

Date Published

Dec. 11, 2025

Author

Xuan-Son Nguyen and Victor Mustar

Word Count

740

Company Posts That Month

48

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/ggml-org/model-management-in-llamacpp

Summary

Llama.cpp has introduced a router mode in its server, enabling dynamic model management without requiring server restarts, a feature inspired by Ollama-style model management. This new capability allows users to load, unload, and switch between multiple models seamlessly, using a multi-process architecture that keeps other models running even if one crashes. The server auto-discovers models from caches or specified directories and supports on-demand loading with an LRU eviction policy to manage up to four models by default. It facilitates model selection through the request's model field and supports various configurations via command-line options or presets. Additionally, a web UI is available for model switching, making it easier for developers to conduct A/B testing, implement multi-tenant deployments, and switch models during development without needing to restart the server. The community has responded positively, discussing potential improvements and integrations on platforms like GitHub.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	1	3,775	638	202	-32%