Turning Production Logs into Evaluation Datasets: A Data-Driven Approach

Post Details

Company

Fireworks AI

Date Published

Jan. 24, 2026

Author

-

Word Count

1,351

Language

English

Hacker News Points

-

Source URL

fireworks.ai/blog/Turning-Production-Logs-into-Evaluation-Datasets

Summary

For teams running large language models (LLMs) in production, using production logs to create evaluation datasets is a critical but challenging task due to the unstructured nature of raw data and the high volume of redundant queries. A data-driven approach using semantic clustering can transform these logs into manageable, high-quality evaluation datasets that reflect real user interactions. This process involves converting user queries into vector embeddings, reducing dimensional complexity with UMAP, and applying HDBSCAN for automated clustering. The result is a stratified sample that captures diverse user intents across different clusters, ensuring comprehensive coverage of user queries. Lilac, an open-source tool, facilitates this by allowing teams to visualize and refine their data, making it easier to create datasets that balance common queries with critical edge cases. The integration of Lilac with Eval Protocol operationalizes this workflow, enabling teams to efficiently generate evaluation datasets that provide realistic, efficient, and insightful assessments of LLM performance based on actual user traffic.