Models got an order of magnitude better at following instructions in one year
Blog post from Arize
Over the past year, AI models have significantly improved in their ability to follow complex instructions, as demonstrated by the updated IFScale benchmark. This benchmark, originally detailed by Jaroslawicz et al. (2025), measures how well models can adhere to numerous constraints, such as including specific keywords in a business report. While older models struggled to maintain accuracy beyond 200-300 simultaneous instructions, current frontier models, like GPT 5.5 and Gemini 3.1 Pro, can now handle up to 5,000 instructions with high accuracy. This advancement has implications for AI engineering, allowing for more detailed prompts and reducing the need for compressed skill files, although it introduces new considerations regarding cost and processing time. Different models exhibit unique failure modes; for example, some models politely refuse complex tasks, while others overthink or misinterpret constraints. Despite these challenges, the ability to manage extensive instructions opens new possibilities for developing sophisticated AI applications.