Closing the verification loop: Observability-driven harnesses for building with agents
Blog post from Datadog
AI agents have revolutionized software development by drastically accelerating the code generation process, but this has shifted the bottleneck to verifying the correctness of that code. At Datadog, a strategy known as harness-first engineering has been adopted, where automated checks replace extensive human review to ensure code reliability. This involves using deterministic simulation testing, formal specifications, and observability-driven feedback loops to verify AI-generated code. Two projects, redis-rust and Helix, illustrate the approach's effectiveness; both achieved significant performance improvements and maintained correctness without the need for traditional code reviews. Redis-rust demonstrated an 87% memory reduction, while Helix improved produce latency significantly compared to a baseline Kafka cluster. The harness-first methodology allows for quick iterations, with the human role focused on defining system ideas and strengthening verification processes, highlighting a shift in engineering towards designing checks rather than inspecting outputs. The integration of formal methods and automated pipelines has inverted the traditional balance between scalability and rigor, enabling AI agents to handle tasks that previously required significant human oversight. Observability ensures that any discrepancies between modeled behavior and actual performance can refine the verification process over time, marking a significant step toward industrializing software engineering.