Evaluating LLM applications poses challenges beyond deterministic tasks like SQL queries, especially for qualitative scenarios such as assessing a chatbot's empathy or summarizing complex transcripts. To address this, breaking down complex tasks into manageable subtasks, such as intent classification, information retrieval, and response generation, facilitates more rigorous evaluation. Two primary methods are highlighted: using negative examples to catalog and test against failure modes, ensuring continuous system improvement, and employing an LLM as a judge rubric to document and test response criteria, though this requires careful construction to avoid misalignment with business needs. Dialogue-based evaluation also has its place, suggesting freezing conversations at specific points for assessment. Effective evaluation requires clear articulation of expectations and investment in defining criteria for success, ultimately enhancing LLM application performance. PromptLayer, a prompt management system, aids in speeding up the development cycle by facilitating prompt iteration and evaluation.