Home / Companies / Vespa / Blog / Post Details
Content Deep Dive

GPU-accelerated ML inference in Vespa Cloud

Blog post from Vespa

Post Details
Company
Date Published
Author
Martin Polden
Word Count
581
Language
English
Hacker News Points
-
Summary

Vespa has introduced GPU-accelerated ONNX model inference in Vespa Cloud, offering enhanced performance and cost efficiency compared to CPU instances. Users can configure GPU instances in AWS zones through the services.xml file, with automatic provisioning and configuration by Vespa Cloud. GPU support is also available for open-source Vespa, requiring specific container configurations. A benchmark using the CORD-19 application demonstrated that GPU instances significantly reduce latency, achieving an average latency of 212 ms and throughput of 18.8 QPS, compared to CPU instances with 1011 ms latency and 3.95 QPS, while also being 13% more cost-effective. This highlights the potential advantages of GPU acceleration for machine learning model inference in Vespa.