Incident 7/29/2024: Evaluator outage

Post Details

Company

Gentrace

Date Published

July 30, 2024

Author

Doug Safreno

Word Count

676

Language

English

Hacker News Points

-

Source URL

gentrace.ai/blog/incident-07-29-2024-evaluator-outage

Summary

Gentrace experienced an evaluator outage from July 26th to July 29th, affecting all customers who ran tests. The issue was discovered when a customer reported it, as it was not automatically detected due to limitations in their alert system. The outage was caused by increased data submissions overwhelming their databases and a new architecture that inadvertently introduced a bug, leading to evaluation failures. Manual testing did not catch the bug because AI evaluations returned expected results despite issues, and automated tests assumed tasks would fetch and process data correctly. Monitoring failed to alert the team because only task failures, not evaluation failures, were flagged. To prevent future issues, Gentrace plans to implement comprehensive integration tests, anomaly detection for evaluation failures, and enhanced manual testing with varied seed data.