Machine-learned model serving at scale
Blog post from Vespa
Machine-learned model serving at scale often encounters challenges when managing concurrent requests, particularly due to default thread settings that lead to resource contention and increased latency. Platforms like Vespa.ai, which use ONNX Runtime for model acceleration, demonstrate these issues with models like BERT-base, showing significant performance degradation as concurrent request numbers increase. The solution involves adjusting the threading model to allow each model evaluation to run sequentially in its own thread, thereby reducing intra-operation thread competition. Additionally, model distillation techniques, such as using XtremeDistilTransformers, offer a way to maintain accuracy while reducing computational complexity and improving throughput. Despite these optimizations, understanding the nuances of latency promises is crucial, as high concurrency can drastically affect real-world performance beyond advertised capabilities.