Home / Companies / Vespa / Blog / Post Details
Content Deep Dive

Stateful model serving: how we accelerate inference using ONNX Runtime

Blog post from Vespa

Post Details
Company
Date Published
Author
Lester Solbakken
Word Count
3,164
Language
English
Hacker News Points
-
Summary

Vespa.ai, an open-source platform for real-time data processing over large datasets, has integrated ONNX Runtime to enhance its capabilities in stateful model serving, particularly for applications requiring complex machine learning models. Unlike stateless model serving, stateful evaluation combines input data with stored information, making it suitable for tasks like search and recommendation. Vespa.ai efficiently processes large volumes of data by deploying machine-learned models across stateful content nodes, reducing query-time data transportation costs. The integration of ONNX Runtime has significantly boosted Vespa.ai's performance in evaluating large models, such as BERT and other Transformers, by leveraging hardware acceleration and model optimizations like quantization. This integration allows Vespa.ai to support a wide range of models without vendor lock-in, utilizing ONNX's interoperability standard. Despite initial challenges with supporting complex models, ONNX Runtime's features, including multi-threading control and zero-copy tensor operations, have proven beneficial. Vespa.ai continues to explore ONNX Runtime's potential, such as GPU support, to further optimize its machine learning applications.