Company
Date Published
Author
Lally Singh, Ashwin Venkatesan
Word count
2373
Language
English
Hacker News points
3

Summary

At Datadog, transitioning an existing job system to Kubernetes initially led to performance regression, with increased CPU time and slower job completion rates. The solution involved performance tuning, timing analysis, and optimizing Kubernetes configurations. Initial experiments showed underutilization and inefficiencies due to incorrect deployment setups and node configurations. By adjusting resource requests for pods and reducing the interval for worker status checks, the team reduced node count and improved efficiency. Key findings included that Kubernetes overhead was minimal in terms of memory and CPU, and further optimization could enhance performance. The transition ultimately enabled a more manageable and scalable system, leveraging Kubernetes' benefits across cloud providers.