Home / Companies / Seldon / Blog / Post Details
Content Deep Dive

What is Multi-Model Serving and How Does it Transform your ML Infrastructure?

Blog post from Seldon

Post Details
Company
Date Published
Author
Seldon
Word Count
934
Language
English
Hacker News Points
-
Summary

Multi-model serving (MMS) is an advanced approach that enhances machine learning (ML) infrastructure by enabling multiple models to run on shared servers, thereby reducing the infrastructure footprint and achieving cost and energy savings. This method is particularly efficient with the "Overcommit" functionality, which allows servers to handle more models than their memory capacity by using a least-recently-used cache mechanism to keep active models in memory while moving less-used ones to disk. Traditional single-model serving setups, where each model is deployed in a separate container, often lead to inefficient resource allocation, especially as the number of models scales up, resulting in increased overhead and costs. MMS addresses these issues by optimizing resource usage, improving CPU/GPU sharing, and eliminating cold start delays, which is when container images must be downloaded before model deployment. The integration of MMS with autoscaling and Overcommit capabilities facilitates intelligent resource management, accommodating fluctuating demand patterns and offering significant savings in both infrastructure costs and energy consumption, which is critical in constrained environments like edge device deployments.