Home / Companies / Cohere / Blog / Post Details
Content Deep Dive

LLM Serving Fairness: No more noisy neighbours

Blog post from Cohere

Post Details
Company
Date Published
Author
Blog
Word Count
1,835
Company Posts That Month
8
Language
English
Hacker News Points
-
Summary

Cohere has developed a solution called "Serving Fairness" to address the issue of "noisy neighbors" in multi-tenant SaaS platforms, where large language model inference requests can cause latency problems for other users due to uneven traffic bursts. The approach combines architectural patterns and a scheduling algorithm to ensure fair distribution of compute resources among tenants, using mechanisms such as rate limiting, performance tiering, and the Deficit Round Robin (DRR) algorithm. These mechanisms work in a structured flow where rate limiting controls request admission, while performance tiering prioritizes requests based on service-level agreements. DRR ensures equitable compute distribution by granting each tenant a budget of requests or tokens, preventing any single tenant from monopolizing resources. Within each tenant, requests are further ordered by priority, deadline, and arrival time, maintaining fairness and urgency. This system allows Cohere to provide a balanced and predictable service experience, ensuring that each tenant receives their fair share of resources without being affected by others' traffic spikes. Serving Fairness is now available for all users of Cohere models via their SaaS API and third-party marketplaces like AWS, with the company inviting feedback for continuous improvement.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
LLM 3 5,172 1,006 220 -43%
Real-time 3 5,457 1,338 238 -5%
Vector Search 3 2,091 556 118 -8%