In the exploration of whether to use the same language model (LLM) for both reasoning and evaluation in AI agents, an experiment was conducted using a movie recommendation agent with four core models: OpenAI's GPT-4.1, Anthropic's Claude 3.7, Google's Gemini 2.5, and the open-source Qwen3-235B. The study aimed to uncover self-evaluation bias, where models might score their outputs more favorably than those of others. Initial tests showed all models exhibited some self-evaluation bias when judged internally, but only Google's model continued to show clear bias after calibrating against human scores. Anthropic's model was noted for its consistency and alignment with human judgments, highlighting its stability across evaluators. Limitations included a small dataset and a focus on orchestration quality, which may not fully capture other relevant performance dimensions. The study's findings, while not definitive, provide insights for those developing and testing AI systems.