Company
Date Published
Author
Doug Safreno
Word count
676
Language
English
Hacker News points
None

Summary

Gentrace experienced an evaluator outage from July 26th to July 29th, affecting all customers who ran tests. The issue was discovered when a customer reported it, as it was not automatically detected due to limitations in their alert system. The outage was caused by increased data submissions overwhelming their databases and a new architecture that inadvertently introduced a bug, leading to evaluation failures. Manual testing did not catch the bug because AI evaluations returned expected results despite issues, and automated tests assumed tasks would fetch and process data correctly. Monitoring failed to alert the team because only task failures, not evaluation failures, were flagged. To prevent future issues, Gentrace plans to implement comprehensive integration tests, anomaly detection for evaluation failures, and enhanced manual testing with varied seed data.