The company Speechmatics has moved its transcription models to use GPUs, which significantly improves accuracy but also increases costs. To ensure efficient processing and maintain cost-effectiveness, they use the Real Time Factor (RTF) ratio as a guideline for performance. However, running on GPU hardware introduces challenges such as shared resources and unpredictability in traffic demand. To address this, Speechmatics uses Kubernetes Event-Drive Autoscaling (KEDA) with Prometheus integration to scale their GPUs based on metrics provided by the Triton Server's `/metrics` endpoint. KEDA allows them to scale out based on specific metrics, including inference queue duration and count, which provides a more accurate representation of performance issues. Additionally, they implement deallocation on scale-down mode in AKS to accelerate node scaling and reduce pending time, resulting in improved cost efficiency and reliability.