Home / Companies / Arize / Blog / Post Details
Content Deep Dive

Models got an order of magnitude better at following instructions in one year

Blog post from Arize

Post Details
Company
Date Published
Author
Laurie Voss
Word Count
2,175
Language
English
Hacker News Points
-
Summary

Over the past year, AI models have significantly improved in their ability to follow complex instructions, as demonstrated by the updated IFScale benchmark. This benchmark, originally detailed by Jaroslawicz et al. (2025), measures how well models can adhere to numerous constraints, such as including specific keywords in a business report. While older models struggled to maintain accuracy beyond 200-300 simultaneous instructions, current frontier models, like GPT 5.5 and Gemini 3.1 Pro, can now handle up to 5,000 instructions with high accuracy. This advancement has implications for AI engineering, allowing for more detailed prompts and reducing the need for compressed skill files, although it introduces new considerations regarding cost and processing time. Different models exhibit unique failure modes; for example, some models politely refuse complex tasks, while others overthink or misinterpret constraints. Despite these challenges, the ability to manage extensive instructions opens new possibilities for developing sophisticated AI applications.