Home / Companies / GrowthBook / Blog / Post Details
Content Deep Dive

The Benchmarks Are Lying to You: Why You Should A/B Test Your AI | Growthbook Blog

Blog post from GrowthBook

Post Details
Company
Date Published
Author
-
Word Count
1,337
Language
English
Hacker News Points
-
Summary

The evaluation of AI models through traditional benchmarks often fails to reflect their performance in real-world applications, as demonstrated by the disconnect between models like GPT-5's high scores on coding benchmarks and the actual preference for Anthropic's models by developers for practical use. Benchmarks, which are designed to measure model performance on standardized tasks, are criticized for not aligning with the specific needs and constraints of production environments, such as cost, latency, and task-specific performance. Instead, rigorous A/B testing with real users and workloads is advocated as a more reliable method for selecting and optimizing large language models (LLMs), as it allows businesses to assess metrics that truly drive value, such as task completion rates and cost efficiency. The "portfolio approach," which employs a variety of models tailored to different tasks, is highlighted as effective for optimizing performance and cost. Ultimately, the true measure of an AI model’s success is its ability to solve user problems within operational constraints, not just its scores on standardized benchmarks.