Home / Companies / Promptfoo / Blog / Post Details
Content Deep Dive

Your model upgrade just broke your agent's safety

Blog post from Promptfoo

Post Details
Company
Date Published
Author
Guangshuo Zang
Word Count
1,980
Language
English
Hacker News Points
-
Summary

Upgrading models, such as from GPT-4o to GPT-4.1, can unexpectedly alter their instruction-following and refusal behaviors, impacting safety and security measures like prompt-injection resistance. These changes necessitate treating upgrades as security adjustments rather than mere quality improvements, as they can affect both model-level safety and broader agent security. While model-level safety encompasses built-in behaviors like refusing harmful requests, agent security involves preventing misuse of tools, data exfiltration, and unauthorized system access. Different model families, like those from OpenAI, Anthropic, and Google, present unique safety challenges that require specific testing for dual-use prompts, multi-turn interactions, and tool-use scenarios. The text emphasizes the importance of defense-in-depth strategies, application-layer guardrails, and continuous testing to ensure secure model deployment, as new updates can alter the balance between helpfulness, safety, and instruction-following, potentially introducing security vulnerabilities.