The Backbone Breaker Benchmark (b3), developed by Lakera in collaboration with the UK AI Security Institute, is a novel approach to evaluating the security of AI agents by focusing on the vulnerabilities within their core large language models (LLMs). Unlike traditional benchmarks that assess the intelligence or safety of a model as a whole, b3 zooms in on the individual steps where LLMs may fail under targeted attacks, utilizing a method called threat snapshots. These snapshots isolate specific moments when an AI agent might make a vulnerable decision, allowing for a focused and reproducible evaluation of LLM security. The b3 benchmark employs nearly 200,000 human red-team attempts from the Gandalf: Agent Breaker project to create a comprehensive dataset for testing models against real-world adversarial scenarios. Findings reveal that models with explicit reasoning processes tend to be more secure and that open-weight models are rapidly closing the security gap with their closed-weight counterparts. The benchmark aims to transform AI security into a measurable and comparable science, offering valuable insights for developers, model providers, researchers, and policymakers, with the ultimate goal of establishing a new standard for evaluating AI agent security.