Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study
Blog post from Together AI
Large reasoning models (LRMs), which generate detailed reasoning traces, are increasingly used for complex tasks, but there is growing concern about their ability to follow user instructions throughout these traces. Together AI introduces ReasonIF, a benchmark dataset to evaluate this capability, focusing on whether LRMs adhere to detailed instructions during reasoning, not just in final responses. The study finds that while LRMs often comply in their final output, they frequently fail to follow instructions in intermediate reasoning steps, especially as task difficulty increases. The ReasonIF dataset includes 300 math and science problems, each with specific instructions to test adherence, revealing significant drops in instruction-following scores (IFS) during reasoning compared to main responses. This shortfall is particularly evident in tasks requiring strict formats like JSON or uppercase text, with some models showing near-zero compliance. The findings suggest that as tasks become more complex, LRMs' ability to follow instructions diminishes, posing challenges for their reliability in real-world applications where nuanced guidance is essential.