Company
Date Published
Author
Theo Julienne
Word count
4183
Language
English
Hacker News points
None

Summary

Over the past few years, GitHub has adopted Kubernetes as a standard deployment pattern for many internal and public-facing services, but encountered sporadic latency issues that were not due to the application performance itself. These latency spikes, sometimes exceeding 100ms, were traced back to delays in packet processing on certain Kubernetes nodes, particularly affecting TCP and ICMP packets. The investigation revealed that this issue was linked to the Linux kernel's packet processing and was exacerbated by cadvisor, a tool used to monitor resource usage in containers, which was inadvertently causing stalls. This was due to slow reads of the memory.stat file, attributed to "zombie" cgroups that retained cached memory after processes exited. While the problem was mitigated by clearing the cache or rebooting affected nodes, a permanent solution was achieved by upgrading to a newer Linux kernel version that improved memory.stat performance. This case highlights the importance of maintaining foundational systems like Kubernetes to ensure the reliability and performance of services built upon them.