Benchmarking agentic search

Post Details

Company

LabelBox

Date Published

June 13, 2025

Author

Labelbox

Word Count

1,352

Language

-

Hacker News Points

-

Source URL

labelbox.com/blog/prompts-to-proof-bencharking-ai-search-agents-across-the-spectrum

Summary

Labelbox conducted a comprehensive study to evaluate the performance of three advanced language models with native web search capabilities: Google Gemini 2.5 Pro, OpenAI GPT-4.1, and Anthropic Claude 4.0 Opus. The study aimed to assess their effectiveness in providing accurate, current, and diverse responses to 200 complex and varied queries across multiple domains, including STEM, current events, historical information, and multi-language contexts. The evaluation focused on four key dimensions: source quality, answer relevance, information recency, and multi-language understanding. Results showed that Gemini excelled in recency and scientific source access, GPT-4.1 was strong in synthesis and reasoning, while Claude was noted for its clarity in explanation. However, citation reliability emerged as a common weakness across all models, highlighting the need for improved source verification processes in enterprise applications. The study emphasizes the importance of considering question types and domain specializations when deploying these models for enterprise use.