Home / Companies / Fireworks AI / Blog / Post Details
Content Deep Dive

Turning Production Logs into Evaluation Datasets: A Data-Driven Approach

Blog post from Fireworks AI

Post Details
Company
Date Published
Author
-
Word Count
1,351
Language
English
Hacker News Points
-
Summary

For teams running large language models (LLMs) in production, using production logs to create evaluation datasets is a critical but challenging task due to the unstructured nature of raw data and the high volume of redundant queries. A data-driven approach using semantic clustering can transform these logs into manageable, high-quality evaluation datasets that reflect real user interactions. This process involves converting user queries into vector embeddings, reducing dimensional complexity with UMAP, and applying HDBSCAN for automated clustering. The result is a stratified sample that captures diverse user intents across different clusters, ensuring comprehensive coverage of user queries. Lilac, an open-source tool, facilitates this by allowing teams to visualize and refine their data, making it easier to create datasets that balance common queries with critical edge cases. The integration of Lilac with Eval Protocol operationalizes this workflow, enabling teams to efficiently generate evaluation datasets that provide realistic, efficient, and insightful assessments of LLM performance based on actual user traffic.