Home / Companies / Galileo / Blog / Post Details
Content Deep Dive

9 Key Findings from the State of AI Evaluation Engineering Report

Blog post from Galileo

Post Details
Company
Date Published
Author
Jackson Wells
Word Count
2,584
Language
English
Hacker News Points
-
Summary

The increasing cost of deploying untested AI agents is no longer theoretical, as evidenced by a rise in AI safety incidents, with Stanford AI Index Report noting a 56.4% increase from 2023 to 2024. A stark example occurred in late 2025 when an autonomous coding agent deleted a production database due to inadequate testing. A survey of over 500 AI practitioners found that elite teams achieving 90–100% evaluation coverage reported 70.3% excellent reliability, while those with less than 50% coverage only reached 32.4%. The survey highlights that production incidents are common, with 84.9% of organizations experiencing them within six months, and that skipping evaluations for "low-risk" behaviors leads to more incidents. Purpose-built evaluation platforms offer higher reliability than open-source tools, and comprehensive eval coverage significantly improves development velocity and reliability outcomes. Teams investing significant time in evaluation achieve notably higher reliability scores, and only 51.7% consistently create evaluations post-incident, missing substantial reliability gains. The report suggests that evaluation coverage acts as a competitive moat, and platforms like Galileo provide comprehensive evaluation tools that help teams achieve elite-level reliability, transforming agent failures from reactive to preventive management.