Evaluating DeepAgents CLI on Terminal Bench 2.0
Blog post from LangChain
DeepAgents CLI is an open-source, Python-based terminal coding agent built on the Deep Agents SDK, designed for tasks across diverse domains such as software engineering, biology, security, and gaming. It includes features like shell command execution, file operations, web search, task planning, and persistent memory storage. To evaluate its performance, it was tested on Terminal Bench 2.0, a benchmark encompassing 89 tasks, where it achieved a mean score of 42.65%, comparable to other implementations using the same model. The testing process is facilitated by Harbor, a framework that executes agents in containerized environments, ensuring isolated and clean evaluations through sandboxing. DeepAgents Harbor allows for scalable evaluation using multiple sandbox providers like Docker and Daytona. The benchmark tests involve a range of tasks, from simple to complex, and results are verified automatically with reward scoring. The evaluation underscores DeepAgents CLI as a competitive solution, with future plans to enhance performance through systematic analyses and optimizations.