Home / Companies / Braintrust / Blog / Post Details
Content Deep Dive

Using Braintrust to eval agentic setups from large-scale Hugging Face data

Blog post from Braintrust

Post Details
Company
Date Published
Author
-
Word Count
4,388
Company Posts That Month
30
Language
English
Hacker News Points
-
Summary

Running an AI agent in production involves processing raw traces that lack straightforward answers, requiring careful querying and evaluation to discern patterns and insights. The analysis of 1,781 agent traces from Exgentic, hosted on Hugging Face, revealed that the choice of harness significantly impacts performance, being approximately seven times more influential than the model itself. Changing the harness can drastically shift success rates from 12% to 92% without affecting token costs. Open-weight models like DeepSeek and Kimi have proven to be production-ready, particularly for coding tasks, achieving success rates comparable to closed models but with the advantage of self-hosting capabilities. The cost per task and cost per success can vary greatly, with open-weight models offering more cost-effective solutions for coding tasks, while closed models like GPT-4.1 are more cost-efficient for conversational tasks. High average performance does not guarantee reliability, as some configurations may perform well overall but fail in specific tasks. There is no universally best model, as different models excel in different task types, and the evaluation highlights the importance of considering both success rates and cost efficiency in selecting AI configurations. Additionally, failure patterns differ between coding and conversational tasks, with coding failures often involving excessive token usage and conversational failures resulting from premature task abandonment. The use of Braintrust enabled structured analysis and regression, turning raw traces into actionable insights and facilitating targeted experiments to address specific failure modes.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
LLM 12 5,172 1,006 220 -43%
OpenTelemetry 2 701 153 53 -26%
Real-time 2 5,457 1,338 238 -5%
AI Agents 1 4,874 1,103 240 -1%