Ameya Bhatawdekar on building AI evaluations at Braintrust
Blog post from WorkOS
At HumanX 2026 in San Francisco, Michael Grinich and Ameya Bhatawdekar from Braintrust discussed the complexities of evaluating AI products to determine their effectiveness and reliability. While developing AI features is relatively straightforward, the challenge lies in validating them across numerous edge cases and real-world conditions. Bhatawdekar emphasized that traditional software testing methods are insufficient for AI systems due to their probabilistic nature, requiring specialized evaluation frameworks to ensure improvement over time. Braintrust addresses this by offering tools that allow teams to define evaluation criteria, experiment with datasets, and monitor changes in output quality, advocating for continuous evaluation alongside development. The conversation highlighted that prompt engineering should be treated with the same rigor as any other engineering work, involving version control and systematic testing. This approach helps bridge the gap between a functioning demo and a reliable production system, with evaluation infrastructure playing a critical role. Bhatawdekar argued that evaluation tooling should become as essential to AI development as CI/CD in software, encouraging teams to invest in evaluation pipelines to prevent production regressions.