A new layer of infrastructure is emerging for AI that mirrors the development of CI/CD, observability, and DevOps in traditional software engineering but is tailored to probabilistic systems driven by large language models (LLMs). This infrastructure is critical as AI products are integrated more into business workflows, shifting the focus from merely building AI to ensuring its reliability, performance, and iterative improvement. Traditional software testing methods fail with AI due to the non-deterministic nature of LLMs, which can produce varied outputs, and the complexity of AI applications that require evaluation frameworks for both outputs and intermediate processes. Current ad-hoc solutions like manual reviews and custom tools are not scalable and hinder the pace of development. The demand for robust, scalable evaluation and observability solutions is rising, with platforms like Braintrust offering systematic frameworks to test, monitor, and improve AI agents, ensuring reliability through features like Brainstore and Loop. As AI complexity and deployment increase, reliable, testable, and observable infrastructure becomes essential, positioning Braintrust as a pivotal player in the next generation of AI development, similar to the role of CI/CD in traditional software.