Monitoring 101: Investigating performance issues

Post Details

Company

Datadog

Date Published

July 16, 2015

Author

Alexis Lê-Quôc

Word Count

1,066

Company Posts That Month

19

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.datadoghq.com/blog/monitoring-101-investigation

Summary

In this post, Alexis Lê-Quôc discusses the importance of a structured approach to monitoring systems, emphasizing that effective monitoring extends beyond symptom detection to diagnosing root causes. Drawing from the experience of monitoring large-scale infrastructure and insights from experts like Brendan Gregg and Baron Schwartz, the article outlines a methodical process for diagnosing issues. It highlights the significance of three main types of monitoring data—work metrics, resource metrics, and events—to thoroughly understand system health and functionality. The post suggests starting investigations with work metrics of the highest-level systems to determine problems, then examining resource metrics if necessary, and considering any correlated events that might have caused the issue. It stresses the importance of pre-built dashboards for quick access to relevant data during outages and underscores a systematic framework for problem investigation, urging feedback from users on this approach.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.