Home / Companies / Octopus Deploy / Blog / Post Details
Content Deep Dive

It works on my cluster: a tale of two troubleshooters

Blog post from Octopus Deploy

Post Details
Company
Date Published
Author
Liam Mackie
Word Count
1,791
Language
English
Hacker News Points
-
Summary

Kubernetes can make simple issues appear complex and complex ones seem simple, often leading to misdiagnoses by the wrong teams, as seen in an incident involving a GraphQL gateway application. The incident began with customer reports of timeouts and errors, initially prompting the infrastructure team to investigate DNS-related issues. Despite thorough checks showing DNS functionality, the problem persisted until the software team discovered that recent changes, including a local cache implementation, were causing threadpool saturation due to file lock contention in a multi-replica production environment. The resolution involved modifying the cache to use memory and adjusting the thread pool size, highlighting the importance of early developer involvement and visibility of deployment history in troubleshooting. The incident underscored the need for documenting dependencies, automating rollbacks, and fostering collaboration across teams when dealing with distributed systems like Kubernetes, ultimately serving as a reminder that the root cause might not always be the most apparent one.