GPU-accelerated ML inference in Vespa Cloud
Blog post from Vespa
Vespa has introduced GPU-accelerated ONNX model inference in Vespa Cloud, offering enhanced performance and cost efficiency compared to CPU instances. Users can configure GPU instances in AWS zones through the services.xml file, with automatic provisioning and configuration by Vespa Cloud. GPU support is also available for open-source Vespa, requiring specific container configurations. A benchmark using the CORD-19 application demonstrated that GPU instances significantly reduce latency, achieving an average latency of 212 ms and throughput of 18.8 QPS, compared to CPU instances with 1011 ms latency and 3.95 QPS, while also being 13% more cost-effective. This highlights the potential advantages of GPU acceleration for machine learning model inference in Vespa.