Can LLMs replace on call SREs today?

Company

ClickHouse

Date Published

Aug. 13, 2025

Author

Lionel Palacin and Al Brown

Word count

14848

Language

English

Hacker News points

None

URL

clickhouse.com/blog/llm-observability-challenge

Summary

The text outlines an experiment conducted by ClickHouse to evaluate whether AI-powered observability, specifically through large language models (LLMs), can effectively replace Site Reliability Engineers (SREs) in performing root cause analysis (RCA). Despite the potential of LLMs like Claude Sonnet 4, OpenAI's GPT models, and Gemini 2.5 Pro, the study concluded that these models are not yet capable of autonomously identifying root causes in complex, real-world scenarios without guidance, even though they can assist with documentation tasks such as drafting RCA reports. The experiment revealed that while models could sometimes pinpoint issues, their performance was inconsistent, largely due to a lack of context and domain specialization. Furthermore, the unpredictability in token usage and cost presents challenges for integrating LLMs into automated observability workflows. The study suggests that the current best approach is a collaborative one that combines human engineers with fast, scalable observability tools and LLMs for supportive tasks, allowing for more efficient and accurate incident resolution.