Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Safety Evals Should Project Test-Time Compute

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Tommaso Cerruti
Word Count
2,521
Language
-
Hacker News Points
-
Summary

Safety evaluations of AI models should consider the potential impact of test-time compute, as a model that seems safe under limited evaluation conditions may become unsafe when adversaries apply larger, adaptive, and economically rational computational resources. The conventional approach of assessing whether a model can perform dangerous actions is inadequate for modern AI systems, where adversaries can employ extensive inference-time efforts like generating numerous prompt variants, using other models to improve attacks, or employing adaptive compute allocation. This shift emphasizes the need for evaluations that factor in the broader risk surface, which includes the model's behavior under varying budgets, attacker strategies, and deployment configurations. The economic rationale for adversaries to invest significant resources in attacks further complicates this landscape, as the potential payoff can justify high expenditure. Static safety checks remain useful but are insufficient for systems capable of longer reasoning, adaptive search, and tool use. As a result, safety evaluations should incorporate test-time compute into the threat model, providing risk assessments that account for different levels of adversarial effort and labeling safety claims with the applicable conditions.