A one-line Kubernetes fix that saved 600 hours a year
Blog post from Cloudflare
Engineers faced a significant bottleneck when restarting Atlantis, a tool used to manage Terraform changes, due to Kubernetes' default behavior of recursively changing file permissions on a persistent volume. This resulted in 30-minute restart times, blocking engineering work and triggering false alarms. The issue arose because the persistent volume had grown to millions of files, causing inode exhaustion and slow operations. The team identified that Kubernetes' fsGroupChangePolicy, which defaults to Always, was unnecessarily changing permissions on every file during restarts. By setting this policy to OnRootMismatch, they reduced restart times to 30 seconds, reclaiming nearly 50 hours of engineering time per month. This fix highlights how default Kubernetes settings can become bottlenecks as data scales, and emphasizes the importance of auditing securityContext settings to optimize performance.