Best practices for monitoring and remediating connection churn

Company

Datadog

Date Published

Sept. 18, 2024

Author

Nicholas Thomson, Guy Arbitman

Word count

1691

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/monitor-connection-churn-datadog

Summary

Nicholas Thomson and Guy Arbitman discuss the importance of monitoring connection churn in distributed systems, which can be a sign of an unhealthy system. Connection churn refers to the rate of TCP client connections and disconnections, and it can lead to performance degradation, increased latency, and resource issues. The authors highlight common symptoms of connection churn, such as elevated TCP socket latency, request bottlenecks, and decreased throughput. They also explain the causes of connection churn, including a spike in users, misconfigured client services, and scaling up or down without proper configuration. To troubleshoot connection churn, it is essential to gather monitoring data on all distributed services and track key metrics such as established and closed connections, latency, and error rates. The authors recommend using Datadog's Network Performance Monitoring (NPM) and Universal Service Monitoring (USM) tools to monitor connection churn and pinpoint its root cause in a distributed system.