Using Braintrust to eval agentic setups from large-scale Hugging Face data

Post Details

Company

Braintrust

Date Published

June 25, 2026

Author

-

Word Count

4,388

Company Posts That Month

30

Language

English

Hacker News Points

-

Source URL

www.braintrust.dev/blog/hf-agent-traces

Summary

Running an AI agent in production involves processing raw traces that lack straightforward answers, requiring careful querying and evaluation to discern patterns and insights. The analysis of 1,781 agent traces from Exgentic, hosted on Hugging Face, revealed that the choice of harness significantly impacts performance, being approximately seven times more influential than the model itself. Changing the harness can drastically shift success rates from 12% to 92% without affecting token costs. Open-weight models like DeepSeek and Kimi have proven to be production-ready, particularly for coding tasks, achieving success rates comparable to closed models but with the advantage of self-hosting capabilities. The cost per task and cost per success can vary greatly, with open-weight models offering more cost-effective solutions for coding tasks, while closed models like GPT-4.1 are more cost-efficient for conversational tasks. High average performance does not guarantee reliability, as some configurations may perform well overall but fail in specific tasks. There is no universally best model, as different models excel in different task types, and the evaluation highlights the importance of considering both success rates and cost efficiency in selecting AI configurations. Additionally, failure patterns differ between coding and conversational tasks, with coding failures often involving excessive token usage and conversational failures resulting from premature task abandonment. The use of Braintrust enabled structured analysis and regression, turning raw traces into actionable insights and facilitating targeted experiments to address specific failure modes.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	12	5,172	1,006	220	-43%
OpenTelemetry	2	701	153	53	-26%
Real-time	2	5,457	1,338	238	-5%
AI Agents	1	4,874	1,103	240	-1%