Machine-learned model serving at scale

Post Details

Company

Vespa

Date Published

Jan. 7, 2022

Author

Lester Solbakken

Word Count

1,680

Language

English

Hacker News Points

-

Source URL

blog.vespa.ai/ml-model-serving-at-scale

Summary

Machine-learned model serving at scale often encounters challenges when managing concurrent requests, particularly due to default thread settings that lead to resource contention and increased latency. Platforms like Vespa.ai, which use ONNX Runtime for model acceleration, demonstrate these issues with models like BERT-base, showing significant performance degradation as concurrent request numbers increase. The solution involves adjusting the threading model to allow each model evaluation to run sequentially in its own thread, thereby reducing intra-operation thread competition. Additionally, model distillation techniques, such as using XtremeDistilTransformers, offer a way to maintain accuracy while reducing computational complexity and improving throughput. Despite these optimizations, understanding the nuances of latency promises is crucial, as high concurrency can drastically affect real-world performance beyond advertised capabilities.