Models got an order of magnitude better at following instructions in one year

Post Details

Company

Arize

Date Published

May 12, 2026

Author

Laurie Voss

Word Count

2,175

Language

English

Hacker News Points

-

Source URL

arize.com/blog/llm-instruction-following-benchmark-2026

Summary

Over the past year, AI models have significantly improved in their ability to follow complex instructions, as demonstrated by the updated IFScale benchmark. This benchmark, originally detailed by Jaroslawicz et al. (2025), measures how well models can adhere to numerous constraints, such as including specific keywords in a business report. While older models struggled to maintain accuracy beyond 200-300 simultaneous instructions, current frontier models, like GPT 5.5 and Gemini 3.1 Pro, can now handle up to 5,000 instructions with high accuracy. This advancement has implications for AI engineering, allowing for more detailed prompts and reducing the need for compressed skill files, although it introduces new considerations regarding cost and processing time. Different models exhibit unique failure modes; for example, some models politely refuse complex tasks, while others overthink or misinterpret constraints. Despite these challenges, the ability to manage extensive instructions opens new possibilities for developing sophisticated AI applications.