Using Braintrust to eval agentic setups from large-scale Hugging Face data
Blog post from Braintrust
Running an AI agent in production involves processing raw traces that lack straightforward answers, requiring careful querying and evaluation to discern patterns and insights. The analysis of 1,781 agent traces from Exgentic, hosted on Hugging Face, revealed that the choice of harness significantly impacts performance, being approximately seven times more influential than the model itself. Changing the harness can drastically shift success rates from 12% to 92% without affecting token costs. Open-weight models like DeepSeek and Kimi have proven to be production-ready, particularly for coding tasks, achieving success rates comparable to closed models but with the advantage of self-hosting capabilities. The cost per task and cost per success can vary greatly, with open-weight models offering more cost-effective solutions for coding tasks, while closed models like GPT-4.1 are more cost-efficient for conversational tasks. High average performance does not guarantee reliability, as some configurations may perform well overall but fail in specific tasks. There is no universally best model, as different models excel in different task types, and the evaluation highlights the importance of considering both success rates and cost efficiency in selecting AI configurations. Additionally, failure patterns differ between coding and conversational tasks, with coding failures often involving excessive token usage and conversational failures resulting from premature task abandonment. The use of Braintrust enabled structured analysis and regression, turning raw traces into actionable insights and facilitating targeted experiments to address specific failure modes.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| LLM | 12 | 5,172 | 1,006 | 220 | -43% |
| OpenTelemetry | 2 | 701 | 153 | 53 | -26% |
| Real-time | 2 | 5,457 | 1,338 | 238 | -5% |
| AI Agents | 1 | 4,874 | 1,103 | 240 | -1% |