Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Post Details

Company

Together AI

Date Published

Oct. 22, 2025

Author

Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou

Word Count

2,221

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/large-reasoning-models-fail-to-follow-instructions-during-reasoning-a-benchmark-study

Summary

Large reasoning models (LRMs), which generate detailed reasoning traces, are increasingly used for complex tasks, but there is growing concern about their ability to follow user instructions throughout these traces. Together AI introduces ReasonIF, a benchmark dataset to evaluate this capability, focusing on whether LRMs adhere to detailed instructions during reasoning, not just in final responses. The study finds that while LRMs often comply in their final output, they frequently fail to follow instructions in intermediate reasoning steps, especially as task difficulty increases. The ReasonIF dataset includes 300 math and science problems, each with specific instructions to test adherence, revealing significant drops in instruction-following scores (IFS) during reasoning compared to main responses. This shortfall is particularly evident in tasks requiring strict formats like JSON or uppercase text, with some models showing near-zero compliance. The findings suggest that as tasks become more complex, LRMs' ability to follow instructions diminishes, posing challenges for their reliability in real-world applications where nuanced guidance is essential.