Dropbox, a prominent cloud storage and collaboration platform, developed Dropbox Dash, an AI-powered tool designed for universal search and organization across connected applications, highlighting the significance of AI evaluation alongside model training in the foundation-model era. The development of Dash involved creating a structured evaluation framework that approaches experiments with the same rigor as production code, shifting from ad-hoc testing to systematic evaluation. This process involved curating diverse datasets, including both public sources like Google's Natural Questions and internal datasets from Dropbox employee usage to mirror real-world complexity. Dropbox utilized large language models (LLMs) as judges to assess factual correctness, citation, and formatting, moving beyond traditional metrics such as BLEU and ROUGE. They adopted Braintrust as an evaluation platform to manage datasets and experiments, ensuring reproducibility and tracing regressions through defined metrics and automated checks. By automating evaluation in the development-to-production pipeline, Dropbox reduced the risk of regressions, integrating continuous improvement by mining low-scoring outputs for new dataset iterations. This approach emphasized the importance of versioning datasets, calibrating model judges, and treating prompt changes as code changes, transforming their AI development process into one that ensures reliable and trustworthy AI products.