How we build evals for Deep Agents
Blog post from LangChain
Deep Agents, an open-source and model-agnostic agent harness, focuses on improving agent behavior by curating targeted evaluations, or "evals," that directly measure the desired behaviors of agents in production environments. By sourcing data from dogfooding, external benchmarks, and custom-written tests, Deep Agents ensures that each eval is designed to reflect real-world tasks and is self-documented with detailed explanations and categorized tags for efficient grouping and analysis. The approach emphasizes quality over quantity, cautioning against an excessive number of evals that might not accurately represent agent capabilities in production. The evals are run using pytest with GitHub Actions, focusing on correctness and efficiency metrics such as step ratio, tool call ratio, and solve rate, which help in refining model harnesses and optimizing agent performance. This methodology not only enhances agent reliability but also helps in efficiently managing resources by concentrating on the aspects that truly impact user experience and cost-effectiveness, ultimately fostering a shared responsibility among team members for maintaining and improving evals.