Droid: The #1 Software Development Agent on Terminal-Bench
Blog post from Factory
Droid has achieved a leading score of 58.75% on Terminal-Bench, setting a new standard for AI agents in terminal environments by emphasizing agent design over model choice. Terminal-Bench is a benchmark designed to test AI agents' abilities to complete complex tasks in a terminal environment, including coding, data workflows, and system tasks. Droid's success is attributed to a model-agnostic design that incorporates systematic environment exploration, hierarchical prompting, and minimalist tool design, which allows it to outperform expensive models and multi-model agents. The system effectively uses hierarchical prompts and model-specific architectures to accommodate diverse model behaviors, while its minimalist tool design reduces error rates and enhances task completion. Droid's ability to swiftly adapt and execute tasks is further optimized by understanding system contexts, optimizing for speed, and maintaining organized task execution plans. Its superiority is showcased in handling challenging tasks through effective exploitation of models like Opus 4.1, which excels in security vulnerability tasks, whereas GPT-5 excels in domains like ML model training. Future developments for Droid include exploring multi-agent architectures, enhancing memory and learning capabilities, and broadening its availability across more interfaces and workflows.