Company
Date Published
Author
Adam Walford
Word count
1277
Language
English
Hacker News points
None

Summary

The company Speechmatics has moved its transcription models to use GPUs, which significantly improves accuracy but also increases costs. To ensure efficient processing and maintain cost-effectiveness, they use the Real Time Factor (RTF) ratio as a guideline for performance. However, running on GPU hardware introduces challenges such as shared resources and unpredictability in traffic demand. To address this, Speechmatics uses Kubernetes Event-Drive Autoscaling (KEDA) with Prometheus integration to scale their GPUs based on metrics provided by the Triton Server's `/metrics` endpoint. KEDA allows them to scale out based on specific metrics, including inference queue duration and count, which provides a more accurate representation of performance issues. Additionally, they implement deallocation on scale-down mode in AKS to accelerate node scaling and reduce pending time, resulting in improved cost efficiency and reliability.