Scary Things Happen in Production. Context Helps You Find Them.
Blog post from Honeycomb
In large-scale production systems, anomalies such as unusual request patterns and traffic spikes are commonplace, appearing as noise without detailed analysis. While AI and machine learning can detect anomalies, identifying which ones require action is more challenging, often necessitating an understanding of developer intent and the nuanced context of each system change. This involves tracking changes in production environments, where every deployment is unique, and leveraging high cardinality data to trace the effects of these changes over time. The value of context is emphasized, as it exponentially increases the power of data analysis by allowing for the combination of multiple attributes, revealing insights that were previously obscured. A case study of Homeaglow illustrates how detailed instrumentation and telemetry can surface hidden issues and inform product decisions, demonstrating the importance of treating data collection and analysis as a product problem rather than merely an infrastructure one. The future of this approach includes leveraging agents that require even more context and cardinality, as the tech industry transitions towards automated, factory-like software development processes.