Dagger, co-authored by Alex Suraci and Sam Alba, has implemented support for orchestrating large language models (LLMs) to enhance AI agents in software development workflows. This involves translating Dagger APIs into tools usable by AI agents within sandboxed environments, alongside developing Evals to continuously test code. The implementation faced challenges in expressing unambiguous, model-agnostic APIs to LLMs, resulting in iterative development and debugging using Dagger Cloud to track LLM behavior. Evals, which measure LLM performance against specific prompts, emerged as an essential tool, allowing multiple parallel attempts across models to identify and resolve performance issues. The post details the complexity of designing clear prompts, ergonomic tools, and the role of SystemPrompts in stabilizing model behavior, while also introducing the Dagger Evaluator module as a resource for building and running Evals effectively.