Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

A one-line Kubernetes fix that saved 600 hours a year

Blog post from Cloudflare

Post Details
Company
Date Published
Author
Braxton Schafer
Word Count
1,346
Language
English
Hacker News Points
-
Summary

Engineers faced a significant bottleneck when restarting Atlantis, a tool used to manage Terraform changes, due to Kubernetes' default behavior of recursively changing file permissions on a persistent volume. This resulted in 30-minute restart times, blocking engineering work and triggering false alarms. The issue arose because the persistent volume had grown to millions of files, causing inode exhaustion and slow operations. The team identified that Kubernetes' fsGroupChangePolicy, which defaults to Always, was unnecessarily changing permissions on every file during restarts. By setting this policy to OnRootMismatch, they reduced restart times to 30 seconds, reclaiming nearly 50 hours of engineering time per month. This fix highlights how default Kubernetes settings can become bottlenecks as data scales, and emphasizes the importance of auditing securityContext settings to optimize performance.