Home / Companies / Render / Blog / Post Details
Content Deep Dive

Breaking down OpenAI's outage: How to avoid a hidden DNS dependency in Kubernetes

Blog post from Render

Post Details
Company
Date Published
Author
David Mauskop
Word Count
1,469
Language
English
Hacker News Points
-
Summary

OpenAI recently faced a platform-wide outage caused by a newly-deployed telemetry service that overloaded their Kubernetes (K8s) control planes, highlighting the importance of distinguishing between the control and data planes in distributed systems. At Render, similar experiences have shown that dependencies between these planes, particularly involving DNS resolution, can exacerbate incidents when the control plane is compromised. Render's past incident involving an etcd memory spike demonstrated the need to run CoreDNS on data plane nodes instead of control plane nodes to prevent DNS failures during control plane outages. By redesigning their control plane and isolating essential services like CoreDNS and etcd, Render mitigated the risk of severe downtimes, underscoring the value of control plane-data plane separation. Telemetry services, often running as daemonsets on every node, can overwhelm K8s API servers due to resource-intensive operations, as seen in both OpenAI's and Render's incidents. These experiences serve as critical lessons for enhancing infrastructure reliability, emphasizing proactive monitoring and configuration adjustments to avoid cascading failures.