Agent evaluation engineering emerges as a critical discipline focused on assessing the performance of AI agents, which differ from traditional machine learning models due to their non-deterministic nature and complex decision chains. Unlike conventional ML evaluation that focuses on static input-output models, agent evaluation considers entire decision-making processes, including tool selection, action sequencing, and error recovery, across multiple dimensions such as end-to-end task success, step-level quality, and system-level performance. This practice emphasizes continuous evaluation throughout the agent lifecycle to accommodate the unpredictable behavior of agents in production, where real-world inputs introduce edge cases, and failures can cascade across workflows. Effective agent evaluation frameworks rely on well-defined metrics, context-sensitive datasets, and consistent monitoring across pre-production and production environments, with an emphasis on integrating human feedback to refine metrics over time. As agents behave differently in live settings compared to controlled tests, dedicated agent evaluation engineers play a crucial role in designing robust evaluation methodologies to ensure that autonomous systems are reliable, safe, and cost-effective in real-world applications.