Why aspirational evals are critical when new AI models launch

Post Details

Company

Braintrust

Date Published

Sept. 29, 2025

Author

Ornella Altunyan

Word Count

747

Language

English

Hacker News Points

-

Source URL

www.braintrust.dev/blog/claude-sonnet-4-5-aspirational-evals

Summary

Anthropic's recent release of Claude Sonnet 4.5 has set new standards in AI performance, particularly in coding and reasoning tasks, boasting a 77.2% score on SWE-bench Verified and extending autonomous operations to over 30 hours. The company's approach focuses on "aspirational evals," tests for capabilities not yet existing, which are crucial for identifying new applications beyond standard benchmarks. These evals help define product features that are currently limited by AI constraints, and with each model release, Anthropic assesses whether these features can now be developed. The leap from Claude Sonnet 4 to 4.5 exemplifies the "capability cliff," where AI models drastically improve, enabling new applications rather than just incremental advancements. Through Loop, Anthropic tests for unsupervised prompt optimization, demonstrating significant performance improvements and faster inference times with Claude Sonnet 4.5. This strategy of rapid evaluation and feature deployment allows Anthropic to quickly capitalize on new AI developments, providing an edge over competitors who follow traditional model assessment cycles.