Your model upgrade just broke your agent's safety

Post Details

Company

Promptfoo

Date Published

Dec. 8, 2025

Author

Guangshuo Zang

Word Count

1,980

Language

English

Hacker News Points

-

Source URL

www.promptfoo.dev/blog/model-upgrades-break-agent-safety

Summary

Upgrading models, such as from GPT-4o to GPT-4.1, can unexpectedly alter their instruction-following and refusal behaviors, impacting safety and security measures like prompt-injection resistance. These changes necessitate treating upgrades as security adjustments rather than mere quality improvements, as they can affect both model-level safety and broader agent security. While model-level safety encompasses built-in behaviors like refusing harmful requests, agent security involves preventing misuse of tools, data exfiltration, and unauthorized system access. Different model families, like those from OpenAI, Anthropic, and Google, present unique safety challenges that require specific testing for dual-use prompts, multi-turn interactions, and tool-use scenarios. The text emphasizes the importance of defense-in-depth strategies, application-layer guardrails, and continuous testing to ensure secure model deployment, as new updates can alter the balance between helpfulness, safety, and instruction-following, potentially introducing security vulnerabilities.