How we build evals for Deep Agents

Post Details

Company

LangChain

Date Published

March 26, 2026

Author

-

Word Count

1,910

Company Posts That Month

25

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.langchain.com/blog/how-we-build-evals-for-deep-agents

Summary

Deep Agents, an open-source and model-agnostic agent harness, focuses on improving agent behavior by curating targeted evaluations, or "evals," that directly measure the desired behaviors of agents in production environments. By sourcing data from dogfooding, external benchmarks, and custom-written tests, Deep Agents ensures that each eval is designed to reflect real-world tasks and is self-documented with detailed explanations and categorized tags for efficient grouping and analysis. The approach emphasizes quality over quantity, cautioning against an excessive number of evals that might not accurately represent agent capabilities in production. The evals are run using pytest with GitHub Actions, focusing on correctness and efficiency metrics such as step ratio, tool call ratio, and solve rate, which help in refining model harnesses and optimizing agent performance. This methodology not only enhances agent reliability but also helps in efficiently managing resources by concentrating on the aspects that truly impact user experience and cost-effectiveness, ultimately fostering a shared responsibility among team members for maintaining and improving evals.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Harness engineering	5	154	104	59	+22%
LLM	2	6,078	960	218	+18%
Real-time	1	6,457	1,307	242	+28%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.