You don’t know what your agent will do until it’s in production
Blog post from LangChain
Agent-based software, unlike traditional software, presents unique challenges in production monitoring due to its reliance on natural language inputs and large language models (LLMs) that exhibit non-deterministic behavior and prompt sensitivity. Unlike traditional software with finite input spaces and predictable code paths, agents must handle an infinite variety of user queries and perform complex multi-step reasoning, making traditional observability tools insufficient. Effective monitoring requires capturing complete prompt-response pairs, understanding multi-turn contexts, and analyzing decision-making trajectories. Human judgment is crucial for evaluating natural language interactions, but manual review is resource-intensive, leading to the adoption of structured annotation queues and LLMs as proxies for human judgment. Tools like LangSmith provide specialized capabilities for agent observability, helping teams discover usage patterns, evaluate quality continuously, and integrate monitoring with development workflows. These approaches enable cross-functional teams to address the challenges of agent observability, focusing more on monitoring the inputs and outputs of agents rather than just system metrics.