Writing my first evals

Post Details

Company

WorkOS

Date Published

March 4, 2026

Author

Nick Nisi

Word Count

3,365

Company Posts That Month

50

Language

English

Hacker News Points

-

Post removed?

No

Source URL

workos.com/blog/writing-my-first-evals

Summary

The author recounts their journey of developing evaluation systems for two AI-powered developer tools at WorkOS, focusing on the challenge of determining whether the tools genuinely enhance developer experiences. The first tool, WorkOS CLI, utilizes the Claude Agent SDK to automatically install WorkOS AuthKit across various frameworks, but its outputs vary widely, complicating testing. To address this, the author created an evaluation system using fixture projects to establish baseline states for comparison post-installation, grading the outputs on both functional and quality metrics. The second tool involves generating structured context documents for WorkOS features to enhance AI-driven developer assistance. Here, the evaluation involves A/B testing to see if context improves LLM outputs, revealing that some contexts inadvertently degrade performance. Both systems emphasize statistical measurement rather than deterministic testing, with a focus on tracking trends, understanding nuances through saved transcripts, and calibrating evaluations against real-world scenarios, ultimately leading to a more data-driven approach to shipping AI tools.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	18	6,078	960	218	+18%
AI Agents	4	4,545	963	231	+27%
Platform Engineering	1	480	172	60	+30%
Secrets Management	1	1,488	268	99	+7%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.