Better Harness: A Recipe for Harness Hill-Climbing with Evals
Blog post from LangChain
Better-Harness is a system designed to improve AI agents through a process of iteratively refining harnesses using evaluations (evals) as a learning signal, similar to training data in machine learning. The approach emphasizes the importance of high-quality evals, sourced from hand-curated examples, production traces, and external datasets, to guide agents towards desired behaviors and prevent overfitting. The system employs a cycle of data sourcing, experiment design, optimization, and review, with evals categorized by behavioral tags to enable targeted experiments and holdout sets to ensure generalization. By integrating human review and trace analysis, Better-Harness aims to enhance agent performance by discovering and addressing failure modes while maintaining a focus on generalization and avoiding regressions. The results from testing this system with models like Claude Sonnet 4.6 and Z.ai’s GLM-5 show improved agent behavior, demonstrating the potential for this approach to autonomously refine agent harnesses and adapt to various domains.