Should I Use the Same LLM for My Eval as My Agent? Testing Self-Evaluation Bias

Post Details

Company

Arize

Date Published

Oct. 8, 2025

Author

Sanjana Yeddula

Word Count

1,883

Language

English

Hacker News Points

-

Source URL

arize.com/blog/should-i-use-the-same-llm-for-my-eval-as-my-agent-testing-self-evaluation-bias

Summary

In the exploration of whether to use the same language model (LLM) for both reasoning and evaluation in AI agents, an experiment was conducted using a movie recommendation agent with four core models: OpenAI's GPT-4.1, Anthropic's Claude 3.7, Google's Gemini 2.5, and the open-source Qwen3-235B. The study aimed to uncover self-evaluation bias, where models might score their outputs more favorably than those of others. Initial tests showed all models exhibited some self-evaluation bias when judged internally, but only Google's model continued to show clear bias after calibrating against human scores. Anthropic's model was noted for its consistency and alignment with human judgments, highlighting its stability across evaluators. Limitations included a small dataset and a focus on orchestration quality, which may not fully capture other relevant performance dimensions. The study's findings, while not definitive, provide insights for those developing and testing AI systems.