It works on my cluster: a tale of two troubleshooters

Post Details

Company

Octopus Deploy

Date Published

Dec. 9, 2025

Author

Liam Mackie

Word Count

1,791

Language

English

Hacker News Points

-

Source URL

octopus.com/blog/verifying-and-troubleshooting-kubernetes-deployments

Summary

Kubernetes can make simple issues appear complex and complex ones seem simple, often leading to misdiagnoses by the wrong teams, as seen in an incident involving a GraphQL gateway application. The incident began with customer reports of timeouts and errors, initially prompting the infrastructure team to investigate DNS-related issues. Despite thorough checks showing DNS functionality, the problem persisted until the software team discovered that recent changes, including a local cache implementation, were causing threadpool saturation due to file lock contention in a multi-replica production environment. The resolution involved modifying the cache to use memory and adjusting the thread pool size, highlighting the importance of early developer involvement and visibility of deployment history in troubleshooting. The incident underscored the need for documenting dependencies, automating rollbacks, and fostering collaboration across teams when dealing with distributed systems like Kubernetes, ultimately serving as a reminder that the root cause might not always be the most apparent one.