AI evals are becoming the new compute bottleneck
Blog post from HuggingFace
AI evaluation is becoming a significant computational bottleneck due to escalating costs, which now often surpass those of model training. This shift is particularly evident in advanced benchmarks and scientific machine learning tasks, where evaluation expenses can exceed training costs by orders of magnitude. The Holistic Agent Leaderboard (HAL) highlights the high expense of evaluating AI models, with costs of up to $40,000 for a single benchmark run. Compressing evaluations for static benchmarks has proven effective, but agent and training-in-the-loop benchmarks resist such reductions, leading to high costs for reliable assessments. Additionally, the lack of standardized documentation leads to repeated evaluations, further driving up costs. As a result, the divide between institutions able to afford these evaluations and those that cannot is growing, impacting the ability to independently validate AI systems. Reducing these costs through shared documentation and resource pooling could mitigate the economic barrier that evaluations now pose.