Incident postmortem: January 2026 service disruptions

Post Details

Company

Redocly

Date Published

Jan. 27, 2026

Author

Roman Hotsiy

Word Count

1,043

Language

-

Hacker News Points

-

Source URL

redocly.com/blog/jan-2026-outage-postmortem

Summary

Over a two-week period in January 2026, Redocly experienced three major service disruptions caused by infrastructure instability and architectural bottlenecks, impacting their Redocly Reunite management panel and authenticated customer projects. The incidents on January 13 and 26 were due to orchestration layer instability during routine maintenance, which led to leader election failures and memory exhaustion, while the January 14 disruption was caused by a cascading failure in a background job queue that overloaded the database and secrets engine. Immediate corrective actions included infrastructure upgrades with increased server capacity, enhanced monitoring, and operational changes like off-hours maintenance scheduling. Additionally, to prevent future occurrences, Redocly is refactoring queue logic, implementing circuit breakers, and working on architectural decoupling to separate authentication from the main API, thus addressing its status as a single point of failure. The Redocly team is committed to improving reliability and has expressed gratitude for user patience during these improvements.