Breaking down OpenAI's outage: How to avoid a hidden DNS dependency in Kubernetes
Blog post from Render
OpenAI recently faced a platform-wide outage caused by a newly-deployed telemetry service that overloaded their Kubernetes (K8s) control planes, highlighting the importance of distinguishing between the control and data planes in distributed systems. At Render, similar experiences have shown that dependencies between these planes, particularly involving DNS resolution, can exacerbate incidents when the control plane is compromised. Render's past incident involving an etcd memory spike demonstrated the need to run CoreDNS on data plane nodes instead of control plane nodes to prevent DNS failures during control plane outages. By redesigning their control plane and isolating essential services like CoreDNS and etcd, Render mitigated the risk of severe downtimes, underscoring the value of control plane-data plane separation. Telemetry services, often running as daemonsets on every node, can overwhelm K8s API servers due to resource-intensive operations, as seen in both OpenAI's and Render's incidents. These experiences serve as critical lessons for enhancing infrastructure reliability, emphasizing proactive monitoring and configuration adjustments to avoid cascading failures.