Olly for SREs: 3 ways I actually use it in production
Blog post from Coralogix
In a practical breakdown of using an autonomous AI agent, the author describes how the tool, Olly, assists in investigating production incidents by quickly evaluating logs, metrics, traces, and alert contexts to provide a structured summary of issues and guide users to the root cause within minutes. The process begins with identifying whether an alert is indicative of a genuine issue or a transient anomaly, and Olly helps by establishing temporal deviations and correlating error messages with metric spikes. Once changes are understood, the tool assesses whether the service in question is the origin of degradation or merely absorbing impacts, allowing for informed escalation decisions. Olly supports structured hypothesis testing by analyzing evidence tied to different hypotheses, moving from metrics to logs to code, and identifying root causes with suggestions for fixes. This approach compresses the investigation steps, offering a significant time-saving advantage and enhancing the efficiency of incident management in production environments.