Company
Date Published
Author
Ross McFarland
Word count
915
Language
English
Hacker News points
None

Summary

Visibility is crucial for managing complex networks, and as GitHub's network expanded, they enhanced their data collection processes, particularly through metric tagging. This allows the creation of dashboards that can drill down into specific issues by filtering multiple dimensions. An example provided details a rolling update on spine switches, showcasing how traffic flow was monitored to ensure no disruption. Metrics collected include interface statistics and device health, gathered via SNMP, which is complex but essential for effective monitoring. Data is collected by regional nodes cycling every 10 seconds, ensuring continuity even if a host fails. While their system is not open-source due to specific hardware dependencies, GitHub continues to evolve its monitoring capabilities and is seeking to expand its Site Reliability Engineering team.