Writing my first evals
Blog post from WorkOS
The author recounts their journey of developing evaluation systems for two AI-powered developer tools at WorkOS, focusing on the challenge of determining whether the tools genuinely enhance developer experiences. The first tool, WorkOS CLI, utilizes the Claude Agent SDK to automatically install WorkOS AuthKit across various frameworks, but its outputs vary widely, complicating testing. To address this, the author created an evaluation system using fixture projects to establish baseline states for comparison post-installation, grading the outputs on both functional and quality metrics. The second tool involves generating structured context documents for WorkOS features to enhance AI-driven developer assistance. Here, the evaluation involves A/B testing to see if context improves LLM outputs, revealing that some contexts inadvertently degrade performance. Both systems emphasize statistical measurement rather than deterministic testing, with a focus on tracking trends, understanding nuances through saved transcripts, and calibrating evaluations against real-world scenarios, ultimately leading to a more data-driven approach to shipping AI tools.