Agent harnesses have an expiration date
Blog post from Arize
In this exploration of agent harness architecture, the author discusses the evolution and challenges of building harnesses capable of adapting to multiple model generations, particularly focusing on the implicit finish assumption within Claude Code's harness and its implications when applied to different models like OpenAI's GPT series. The piece outlines how the Claude Code harness, which assumes task completion when no tool calls are made, led to issues with models like GPT-4o that separate narration and action, resulting in premature loop exits. To address this, the author at Arize developed an Explicit Finish and an Adaptive Finish harness, with the latter proving to be more reliable in catching false finishes while maintaining efficiency. Through comprehensive testing across different models and tasks, the author highlights the importance of adaptable design and thorough evaluation to ensure harness reliability, emphasizing the need for eval suites that can adapt to model updates and task variations. The article concludes with a reflection on the necessity of understanding and configuring hidden tunings within harnesses to prevent false assumptions and improve performance across various model behaviors.