GPU-accelerated ML inference in Vespa Cloud

Post Details

Company

Vespa

Date Published

March 8, 2023

Author

Martin Polden

Word Count

581

Language

English

Hacker News Points

-

Source URL

blog.vespa.ai/gpu-accelerated-ml-inference-in-vespa-cloud

Summary

Vespa has introduced GPU-accelerated ONNX model inference in Vespa Cloud, offering enhanced performance and cost efficiency compared to CPU instances. Users can configure GPU instances in AWS zones through the services.xml file, with automatic provisioning and configuration by Vespa Cloud. GPU support is also available for open-source Vespa, requiring specific container configurations. A benchmark using the CORD-19 application demonstrated that GPU instances significantly reduce latency, achieving an average latency of 212 ms and throughput of 18.8 QPS, compared to CPU instances with 1011 ms latency and 3.95 QPS, while also being 13% more cost-effective. This highlights the potential advantages of GPU acceleration for machine learning model inference in Vespa.