Home / Companies / Braintrust / Blog / Post Details
Content Deep Dive

How to improve your golden datasets with human review

Blog post from Braintrust

Post Details
Company
Date Published
Author
-
Word Count
1,516
Language
English
Hacker News Points
-
Summary

Incorporating human expertise into the evaluation workflow of AI products is crucial for establishing an authoritative benchmark to compare outputs, ensuring the quality of the product doesn't regress. Human review is integrated into the process by turning production traces into "golden datasets" that evolve over time, helping to refine scorers as data changes. This involves categorizing traces by patterns such as failure mode and sentiment, with tools like Topics clustering traces automatically into named categories. Reviewers apply their expertise to confirm correct outputs, which are specified as "expected" values that guide the evaluation process. Setting up human review requires defining a clear rubric and using review queues to route traces to subject matter experts, ensuring that real-world expertise is applied rigorously. Over time, human-reviewed ground truths are used to develop scalable, automated evaluation systems, with human review shifting from primary evaluation to providing high-quality training signals. It is essential to avoid anti-patterns such as leaving "expected" values blank or mixing additional information into them, as they can undermine the effectiveness of the evaluation process.