Exa has developed a sophisticated evaluation methodology for its AI-powered search engine to ensure superior performance compared to other search APIs. By building a search engine with a distributed crawling system, custom embedding models, and a new vector database, Exa aims to enhance the quality of search results for AI applications. The evaluation process includes both pure result grading, where LLM graders score the relevance and quality of search outcomes, and RAG grading, which assesses how search results improve LLM question-answering accuracy. Exa's approach emphasizes "open evaluations" that allow for flexibility in query sets and rely on LLMs for grading, thus addressing limitations of traditional "closed evals" like MS Marco, which suffer from scale and false negatives. The methodology includes pointwise, pairwise, and listwise aggregation methods for evaluation, balancing theoretical soundness and practicality. The grading process is carefully calibrated with prompts to ensure consistency and correlation with human preferences, using models like GPT-4.1 for assessment. Exa's evaluation philosophy is designed to optimize real-world search performance, enable rapid iteration, and maintain relevance to current topics, providing a comprehensive measure of both search result quality and downstream task performance.