How we scored #1 on Terminal-Bench (52%)
Blog post from Warp
Terminal-Bench is an open-source benchmark designed to evaluate AI agents' performance on complex terminal-based tasks, and Warp, a standalone application, achieved a state-of-the-art success rate of 52% on these tests. The tasks require the agent to navigate a unique shell environment, complete specific test specifications, and validate solutions within time constraints, with Warp's performance varying based on factors such as model specification and task planning. Warp's success is attributed to a backend supporting rapid experimentation, an optimally configured model fallback chain, and the agent's ability to control long-running commands and maintain a todo list. The experimentation involved using Claude Sonnet 4 and Claude Opus 4 models, with the fallback mechanism facilitating retries in case of failures, though Sonnet 4 remained the baseline due to slightly better performance. The integration with Terminal-Bench involved configuring environments and permissions to allow uninterrupted agent actions, and while cross-compilation was necessary for some test environments, running headless was found to be more reliable. The planning step proved crucial for success, as it forced the agent to reason at the outset and allowed for adaptability as tasks progressed.