Day 2 with Cilium: Small configurations that keep large clusters boring
Blog post from Datadog
Operating Cilium at scale involves meticulous configuration and monitoring to maintain reliability across hundreds of Kubernetes clusters, thousands of nodes, and pods in multi-cloud environments, as demonstrated by Datadog's experience. Key to this process is the adoption of native routing to minimize overhead, the standardization of operator/agent splits to manage cloud API interactions, and fine-tuning IP Address Management (IPAM) to optimize resource allocation. Additionally, stable identity management, consistent Maximum Transmission Unit (MTU) settings, and rigorous upgrade validation practices are crucial for preventing disruptions. Datadog emphasizes the importance of monitoring Cilium's control plane and datapath signals to preemptively address issues and ensure operational stability. By leveraging tools like bpftrace and bpftool, they investigate and rectify datapath anomalies, and they adopt the kube-proxy replacement to enhance service load balancing via eBPF. Overall, Datadog's approach highlights the significance of standardized practices in managing large-scale deployments, ensuring that even the largest clusters operate smoothly and predictably.