Stateful model serving: how we accelerate inference using ONNX Runtime
Blog post from Vespa
Vespa.ai, an open-source platform for real-time data processing over large datasets, has integrated ONNX Runtime to enhance its capabilities in stateful model serving, particularly for applications requiring complex machine learning models. Unlike stateless model serving, stateful evaluation combines input data with stored information, making it suitable for tasks like search and recommendation. Vespa.ai efficiently processes large volumes of data by deploying machine-learned models across stateful content nodes, reducing query-time data transportation costs. The integration of ONNX Runtime has significantly boosted Vespa.ai's performance in evaluating large models, such as BERT and other Transformers, by leveraging hardware acceleration and model optimizations like quantization. This integration allows Vespa.ai to support a wide range of models without vendor lock-in, utilizing ONNX's interoperability standard. Despite initial challenges with supporting complex models, ONNX Runtime's features, including multi-threading control and zero-copy tensor operations, have proven beneficial. Vespa.ai continues to explore ONNX Runtime's potential, such as GPU support, to further optimize its machine learning applications.