How to evaluate your agent with Gemini

Post Details

Company

Braintrust

Date Published

Nov. 18, 2025

Author

Braintrust Team

Word Count

2,347

Language

English

Hacker News Points

-

Source URL

www.braintrust.dev/articles/evaluate-agents-new-models-gemini-3

Summary

Google's release of Gemini 3, a new AI model family, presents advancements in reasoning, tool use, and multimodal capabilities, but its real-world application, especially in agent workflows, requires thorough evaluation beyond standard benchmarks. The process of adopting such models involves establishing a performance baseline with current models using production data, followed by systematic testing of Gemini 3 against real-world scenarios and metrics like tool selection accuracy and response quality. Braintrust facilitates this evaluation by converting production traces into test datasets, allowing for straightforward model comparisons and confident deployment decisions. Continuous monitoring in production ensures that any improvements seen in testing are sustained at scale, with feedback loops integrating performance data to refine future evaluations and deployments. This approach allows AI teams to adapt quickly to new model releases, maintaining a robust cycle of evaluation, deployment, and monitoring to ensure models like Gemini 3 enhance agent performance without introducing regressions.