When developing AI systems that interact with the real world, relying solely on standard metrics like accuracy, latency, token usage, and safety is insufficient, particularly in complex, context-heavy domains where success is subjective and context-dependent. This approach necessitates the creation of custom metrics tailored to specific domains, involving subject matter experts (SMEs) to define and evaluate these metrics, ensuring they align with domain-specific expectations and workflows. The process of operationalizing these evaluations involves starting with clear, binary scoring methods for consistency, involving SMEs early in the design and evaluation phases, and gradually scaling the evaluation system while automating parts of it as metrics mature. This method prioritizes user-centered metrics over easy-to-track ones, making evaluation decisions actionable and ensuring domain-specific applications deliver real value, as opposed to relying on generic metrics that might not capture the nuanced performance required in specialized fields.