LangChain's LangSmith platform introduces a feature called Test Run Comparisons to improve the evaluation of large language model (LLM) applications, addressing the challenge of quantitatively assessing changes to prompts, chains, or agents. This feature allows users to manually inspect and compare multiple test runs within a dataset, providing a user-friendly interface to view inputs, reference outputs, actual outputs, and evaluation metrics. Users can apply filters to focus on significant differences between test runs, aiding in the discovery of changes and enhancing understanding of the LLM's performance on specific tasks. By facilitating side-by-side comparisons and enabling deeper exploration of datapoints, LangSmith aims to build infrastructure that supports manual data inspection, aligning with the practices of successful AI researchers and engineers. Currently in private beta, LangSmith invites feedback as it plans to expand access and introduce more features.