It's always DNS . . . except when it's not: A deep dive through gRPC, Kubernetes, and AWS networking

Company

Datadog

Date Published

April 13, 2022

Author

Laurent Bernaille, David Lentz

Word count

3590

Language

English

Hacker News points

URL

www.datadoghq.com/blog/engineering/grpc-dns-and-load-balancing-incident

Summary

The investigation into the error began when a routine update to one of their critical services caused an increase in errors. The logs initially pointed to DNS issues, but further analysis revealed that NodeLocal DNSCache was reaching its concurrency limit, causing OOM errors. Increasing the memory allocation for the pods didn't resolve the issue, and it was unclear why the cache was hitting its limit so frequently. Further investigation led to the discovery of a saturated VPC conntrack, which was preventing network connections and leading to DNS errors. Analyzing VPC Flow Logs revealed that clients were sending SYN requests to old IP addresses after pod deletion, causing a high rate of dropped packets. The issue was eventually resolved by changing the gRPC load balancing policy from `pick_first` to `round_robin`, which caused clients to reconnect automatically and reduced the number of SYN requests sent to old IP addresses. The incident highlighted the importance of understanding edge cases within complex systems and the need for careful analysis and testing before making changes that can have unintended effects.