Home / Companies / Svix / Blog / Post Details
Content Deep Dive

We Had a Partial Outage in the US region

Blog post from Svix

Post Details
Company
Date Published
Author
Tom Hacohen
Word Count
1,213
Language
English
Hacker News Points
-
Summary

We suffered a partial outage on Saturday the 11th of March due to OOM errors in our containers, which resulted in most API calls returning 5xx errors in the US region. The issue was not caused by code changes or unusual traffic, but rather an unexpected memory usage spike that led to repeated container killings. We suspect an edge case with a library used for serialization may be responsible, and have since reproduced the issue locally and mitigated it using a different allocator. The incident highlights the importance of monitoring and understanding AWS metrics, as well as the need for additional observability tools. Our team is taking steps to prevent similar outages in the future, including increasing memory capacity and using more fragmentation-resistant allocators.