Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

Blog post from Together AI

Post Details
Company
Date Published
Author
Yongchan Kwon, Shang Zhu, Federico Bianchi, Kaitlyn Zhou, James Zou
Word Count
2,221
Language
English
Hacker News Points
-
Summary

Large reasoning models (LRMs), which generate detailed reasoning traces, are increasingly used for complex tasks, but there is growing concern about their ability to follow user instructions throughout these traces. Together AI introduces ReasonIF, a benchmark dataset to evaluate this capability, focusing on whether LRMs adhere to detailed instructions during reasoning, not just in final responses. The study finds that while LRMs often comply in their final output, they frequently fail to follow instructions in intermediate reasoning steps, especially as task difficulty increases. The ReasonIF dataset includes 300 math and science problems, each with specific instructions to test adherence, revealing significant drops in instruction-following scores (IFS) during reasoning compared to main responses. This shortfall is particularly evident in tasks requiring strict formats like JSON or uppercase text, with some models showing near-zero compliance. The findings suggest that as tasks become more complex, LRMs' ability to follow instructions diminishes, posing challenges for their reliability in real-world applications where nuanced guidance is essential.