Autoscaling with GPU Transcription models

Company

Speechmatics

Date Published

May 16, 2023

Author

Adam Walford

Word count

1277

Language

English

Hacker News points

None

URL

www.speechmatics.com/company/articles-and-news/autoscaling-with-gpu-transcription-models

Summary

The company Speechmatics has moved its transcription models to use GPUs, which significantly improves accuracy but also increases costs. To ensure efficient processing and maintain cost-effectiveness, they use the Real Time Factor (RTF) ratio as a guideline for performance. However, running on GPU hardware introduces challenges such as shared resources and unpredictability in traffic demand. To address this, Speechmatics uses Kubernetes Event-Drive Autoscaling (KEDA) with Prometheus integration to scale their GPUs based on metrics provided by the Triton Server's `/metrics` endpoint. KEDA allows them to scale out based on specific metrics, including inference queue duration and count, which provides a more accurate representation of performance issues. Additionally, they implement deallocation on scale-down mode in AKS to accelerate node scaling and reduce pending time, resulting in improved cost efficiency and reliability.