/plushcap/analysis/datadog/engineering-grpc-dns-and-load-balancing-incident

It's always DNS . . . except when it's not: A deep dive through gRPC, Kubernetes, and AWS networking

What's this blog post about?

The text describes a series of network issues that occurred when updates were made to a critical service. Initially, DNS errors were suspected as the cause, but further investigation revealed more complex problems involving dropped packets, connection tracking, and gRPC client reconnect algorithms. Through extensive analysis and testing, the team discovered that the root cause was an aggressive gRPC reconnect parameter that led to a SYN flood during rollouts. By addressing this issue, they were able to resolve the incident and gain valuable insights into their network infrastructure.

Company
Datadog

Date published
April 13, 2022

Author(s)
Laurent Bernaille, David Lentz

Word count
3700

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.