How do frontier models perform on real-world finance problems?

Post Details

Company

Surge AI

Date Published

Nov. 3, 2025

Author

Lily Zhao

Word Count

3,212

Language

English

Hacker News Points

-

Source URL

surgehq.ai/blog/finance-eval-real-world

Summary

In a comprehensive evaluation of advanced language models in the finance domain, experts tested three models—GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro—across over 200 scenarios, revealing both their potential and limitations. While GPT-5 emerged as the most effective, excelling in 47% of tasks and outperforming the others in head-to-head comparisons, all models demonstrated significant shortcomings. These included failure to account for real-world financial constraints, poor multi-step workflow execution, and errors in file handling and domain calibration. Specific examples, such as creating PowerPoint presentations for market crash scenarios and updating financial forecasts in Excel, highlighted these deficiencies, with GPT-5 being the only model to produce a nearly complete deliverable, yet still lacking in areas like risk mitigation commentary. The study underscored the models' sophistication but also their systematic gaps, emphasizing the need for high-quality, real-world training data to bridge the gap between theoretical knowledge and practical financial expertise.