LLM Serving Fairness: No more noisy neighbours
Blog post from Cohere
Cohere has developed a solution called "Serving Fairness" to address the issue of "noisy neighbors" in multi-tenant SaaS platforms, where large language model inference requests can cause latency problems for other users due to uneven traffic bursts. The approach combines architectural patterns and a scheduling algorithm to ensure fair distribution of compute resources among tenants, using mechanisms such as rate limiting, performance tiering, and the Deficit Round Robin (DRR) algorithm. These mechanisms work in a structured flow where rate limiting controls request admission, while performance tiering prioritizes requests based on service-level agreements. DRR ensures equitable compute distribution by granting each tenant a budget of requests or tokens, preventing any single tenant from monopolizing resources. Within each tenant, requests are further ordered by priority, deadline, and arrival time, maintaining fairness and urgency. This system allows Cohere to provide a balanced and predictable service experience, ensuring that each tenant receives their fair share of resources without being affected by others' traffic spikes. Serving Fairness is now available for all users of Cohere models via their SaaS API and third-party marketplaces like AWS, with the company inviting feedback for continuous improvement.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| LLM | 3 | 5,172 | 1,006 | 220 | -43% |
| Real-time | 3 | 5,457 | 1,338 | 238 | -5% |
| Vector Search | 3 | 2,091 | 556 | 118 | -8% |