Droid: The #1 Software Development Agent on Terminal-Bench

Post Details

Company

Factory

Date Published

Sept. 25, 2025

Author

Abhay Singhal, Leo Tchourakov, Daniel Flaherty, Stepan Bedratiuk

Word Count

2,103

Company Posts That Month

3

Language

English

Hacker News Points

-

Post removed?

No

Source URL

factory.ai/news/terminal-bench

Summary

Droid has achieved a leading score of 58.75% on Terminal-Bench, setting a new standard for AI agents in terminal environments by emphasizing agent design over model choice. Terminal-Bench is a benchmark designed to test AI agents' abilities to complete complex tasks in a terminal environment, including coding, data workflows, and system tasks. Droid's success is attributed to a model-agnostic design that incorporates systematic environment exploration, hierarchical prompting, and minimalist tool design, which allows it to outperform expensive models and multi-model agents. The system effectively uses hierarchical prompts and model-specific architectures to accommodate diverse model behaviors, while its minimalist tool design reduces error rates and enhances task completion. Droid's ability to swiftly adapt and execute tasks is further optimized by understanding system contexts, optimizing for speed, and maintaining organized task execution plans. Its superiority is showcased in handling challenging tasks through effective exploitation of models like Opus 4.1, which excels in security vulnerability tasks, whereas GPT-5 excels in domains like ML model training. Future developments for Droid include exploring multi-agent architectures, enhancing memory and learning capabilities, and broadening its availability across more interfaces and workflows.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	6	3,636	538	190	-7%
AI Agents	3	2,405	487	169	-3%
Reinforcement learning	1	112	29	18	+14%
Secrets Management	1	1,019	166	73	-2%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.