The AI Evals for Engineers & PMs course, led by Hamel Husain and Shreya Shankar, offers a comprehensive framework for evaluating and enhancing large language model (LLM) applications, as exemplified by the Recipe Bot Workflow. This hands-on course integrates open-source tools like Arize Phoenix and covers a systematic five-step evaluation process, including prompt design, synthetic data and error analysis, LLM-as-a-judge evaluators, retrieval evaluation for retrieval-augmented generation (RAG), and state-level diagnostics. Each step involves specific tasks such as designing and iterating prompts, using synthetic data to identify errors, employing LLMs for automated error judgment, and analyzing retrieval and pipeline states. Phoenix plays a crucial role by logging, tracing, and managing experiments, allowing participants to track progress and make data-driven improvements. This structured approach emphasizes reproducibility and scalability, moving from isolated debugging to a refined workflow that can adapt to increasing system complexity.