Evaluating model outputs in large language model (LLM) applications is challenging due to the difficulty of encoding human preferences into rules, particularly for tasks like chat or writing. Pairwise evaluation, a method where multiple candidate LLM answers are compared to teach LLMs human preference, emerges as a more effective approach. This method is integral to reinforcement learning from human feedback (RLHF) and popular benchmarks like Chatbot Arena, where users or LLMs select the better response between two options. LangSmith has incorporated pairwise evaluation into its toolkit, allowing users to define custom evaluators that compare two LLM outputs based on specified criteria, addressing limitations in standalone evaluation methods. This feature is particularly useful in tasks without a single correct answer, such as generating engaging Tweets from academic papers, where traditional criteria-based evaluations fail to differentiate between models. LangSmith's UI provides insights into which LLM generations are preferred, offering a robust platform for experimentation and evaluation in LLM application development.