Why Attack Success Rate (ASR) Isn't Comparable Across Jailbreak Papers Without a Shared Threat Model
Blog post from Promptfoo
The concept of Attack Success Rate (ASR) in jailbreak attacks on language models is crucial yet often misinterpreted due to a lack of standardization across research papers, which results in inconsistent comparisons. ASR is influenced by various factors including the number of attempts allowed per target, the selection and nature of test prompts, and the model used to judge the outputs. Different research groups define these parameters differently, causing discrepancies in reported ASR values even when the same attack method is employed. For instance, an attack with a 1% success rate per attempt can be reported as 98% successful if measured over many tries, which highlights the importance of understanding the measurement context rather than merely focusing on the ASR figure. A systematic study at NeurIPS 2025 emphasizes that these measurement choices, rather than the inherent quality of the attack, often drive the reported differences between methods. Additionally, the choice of prompts and the judge model can introduce biases and errors that affect ASR, suggesting that researchers need to be transparent about their methodologies to ensure reproducibility and comparability of results. The text further suggests that automation in red teaming introduces additional complexities, and careful consideration of automation choices is necessary to ensure valid measurements.