Home / Companies / GrowthBook / Blog / Post Details
Content Deep Dive

AI Evals vs. A/B Testing: Why You Need Both to Ship GenAI

Blog post from GrowthBook

Post Details
Company
Date Published
Author
Ryan Feigenbaum
Word Count
2,331
Language
English
Hacker News Points
-
Summary

Generative AI has introduced a shift from deterministic to probabilistic engineering, challenging traditional software development paradigms by producing variable outputs and necessitating new evaluation methods. AI evaluations (Evals) check for a model's competence, while A/B testing assesses its value and impact on users, highlighting the need for both in AI product development. Vibe checking, a manual inspection method, is insufficient for scaling probabilistic systems due to its subjective nature. The industry has moved towards systematic AI Evals to quantitatively assess AI applications, yet these evaluations only measure the capability, not the user value, necessitating A/B testing to ascertain business impact such as retention and conversion. To optimize AI deployment, a staged pipeline involving offline evals, shadow mode, feature flags, and full A/B testing is recommended to filter risks and ensure both the competence and value of AI models before they reach end-users.