Home / Companies / Semaphore / Blog / Post Details
Content Deep Dive

Service Outage Postmortem: April 28

Blog post from Semaphore

Post Details
Company
Date Published
Author
Aleksandar Mitrovic
Word Count
615
Language
English
Hacker News Points
-
Summary

On April 28th, 2023, Semaphore experienced a service disruption lasting over three hours due to delayed job processing caused by low-performing database queries that led to a CPU usage spike in the production database. The incident began at 22:23 UTC and was resolved by 01:43 UTC after the on-call SRE team implemented a solution involving a manual vacuum operation on the jobs database table. The root cause was identified as the accumulation of dead tuples in a high-read/write DB table, compounded by the auto-vacuuming function not executing properly due to increased traffic. In response, Semaphore plans to enhance database performance monitoring, conduct regular performance testing, and improve incident communication protocols. They acknowledged a breach in their incident response policy due to a failure in updating their public status page, and they committed to revising their internal procedures and training to ensure better communication during future incidents.