Roleplays: How we wrote evals for the Hume MCP Server

Post Details

Company

Hume

Date Published

June 10, 2025

Author

Richard Marmorstein

Word Count

1,512

Language

English

Hacker News Points

1

Source URL

www.hume.ai/blog/roleplays-evals-hume-mcp-server

Summary

The Hume AI MCP Server allows users to interact with the Hume AI assistant directly through client applications that implement the Model Context Protocol (MCP). This enables tasks such as narrating audiobooks or designing voice characters for video games. The server's development revealed issues with providing a quality experience, including failure to use the right voice, format text correctly, and continue audio from previous generations. To address this, the team put on "prompt engineer hats" and began evaluating the prompt descriptions and field descriptions. They realized that building a production-grade experience required more comprehensive evaluations. The problem of capturing long interactions proved challenging, with existing literature primarily focusing on traditional question/answer type prompts. The team explored different approaches to evaluate the MCP server, including single-response, postponement-tolerant, manually-extended evals and role-play-based evals. The latter approach, which involves offloading task extension to LLMs, was chosen as the most direct method for evaluating the server's performance. This approach allows for a more straightforward evaluation process by analyzing entire transcripts rather than breaking them down into individual stages. While there are weaknesses to this approach, such as tooling limitations and high cost, it provides a practical indicator of whether changes to the prompt improve assistant behavior. The MCP server and its evals are open source, making this approach accessible to others.