How do frontier models perform on real-world finance problems?
Blog post from Surge AI
In a comprehensive evaluation of advanced language models in the finance domain, experts tested three models—GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro—across over 200 scenarios, revealing both their potential and limitations. While GPT-5 emerged as the most effective, excelling in 47% of tasks and outperforming the others in head-to-head comparisons, all models demonstrated significant shortcomings. These included failure to account for real-world financial constraints, poor multi-step workflow execution, and errors in file handling and domain calibration. Specific examples, such as creating PowerPoint presentations for market crash scenarios and updating financial forecasts in Excel, highlighted these deficiencies, with GPT-5 being the only model to produce a nearly complete deliverable, yet still lacking in areas like risk mitigation commentary. The study underscored the models' sophistication but also their systematic gaps, emphasizing the need for high-quality, real-world training data to bridge the gap between theoretical knowledge and practical financial expertise.