Why We Built VIBE Bench: Rethinking Evaluation for Real Workloads
Blog post from HuggingFace
MiniMax introduces VIBE Bench, a novel benchmark designed to evaluate the full-stack capabilities of models in creating complete, runnable applications, emphasizing real user experience and practical deployment value. Unlike traditional benchmarks that focus on static code correctness, VIBE assesses applications through real execution environments by evaluating interaction logic and visual presentation, thus providing a more comprehensive understanding of usability. The benchmark includes diverse technical domains such as native Android and iOS development, and high-fidelity scientific simulations, organized into subsets by technology stack like Web, Simulation, Android, iOS, and Backend. Central to VIBE is the Agent-as-a-Verifier (AaaV) paradigm, where a vision-enabled agent serves as an automated QA tester, interacting with applications to evaluate their behavior and visual output within a sandboxed environment. VIBE Bench comprises three evaluation layers: Execution, Interaction, and Visual & Aesthetics, each addressing different aspects of application viability, usability, and presentation, thus bridging the gap from code correctness to deliverable user experiences.