Introducing WM Bench: A Benchmark for Cognitive Intelligence in World Models
Blog post from HuggingFace
WM Bench is a benchmark designed to evaluate the cognitive intelligence of world models by assessing whether these models truly understand their environments, not just render them convincingly. Unlike existing benchmarks that focus on visual and motion realism, WM Bench introduces a cognitive dimension, scoring models based on their ability to perform prediction-based reasoning, threat response, emotion escalation, contextual memory utilization, and adaptive recovery. The benchmark consists of three pillars—Perception, Cognition, and Embodiment—covering ten categories through 100 scenarios scored on a 1000-point scale. Prometheus v1.0, a reference world model, serves as a baseline for evaluation, highlighting both the strengths and current limitations in cross-embodiment transfer. WM Bench, part of the FINAL Bench family, aims to spark discussion and improvement within the AI community by openly releasing its scoring rubrics and inviting feedback, despite being an early iteration with potential limitations in its complexity and scoring estimates.