Writing Test Evals For Our MCP Server

Post Details

Company

Neon

Date Published

June 18, 2025

Author

David Gomes

Word Count

1,546

Language

English

Hacker News Points

-

Source URL

neon.com/blog/test-evals-for-mcp

Summary

The MCP server's testing process is designed to ensure that large language models (LLMs) can effectively select and utilize the appropriate tools from a suite of over 20 available options, including unique tools for database migrations. The server includes "prepare_database_migration" and "complete_database_migration" tools that facilitate SQL-based database migrations by creating and applying changes on temporary and main branches, respectively. To address the complexity LLMs face in maintaining the correct workflow, the team implemented evaluation tests, or "evals," using a method called "LLM-as-a-judge" to ensure that LLMs use the tools in the correct sequence. These evaluations, which include the "factualityAnthropic" and "mainBranchIntegrityCheck" components, help verify the factual accuracy and integrity of the database migrations. By refining the descriptions for these tools, the team improved their pass rate from 60% to 100% without additional coding. The use of the Braintrust platform provides a user-friendly interface for monitoring and analyzing test runs, underscoring the importance of rigorous testing and the benefits of managed services in software development.