We Had a Partial Outage in the US region

Company

Svix

Date Published

March 12, 2023

Author

Tom Hacohen

Word count

1213

Language

English

Hacker News points

None

URL

www.svix.com/blog/we-had-a-partial-outage

Summary

We suffered a partial outage on Saturday the 11th of March due to OOM errors in our containers, which resulted in most API calls returning 5xx errors in the US region. The issue was not caused by code changes or unusual traffic, but rather an unexpected memory usage spike that led to repeated container killings. We suspect an edge case with a library used for serialization may be responsible, and have since reproduced the issue locally and mitigated it using a different allocator. The incident highlights the importance of monitoring and understanding AWS metrics, as well as the need for additional observability tools. Our team is taking steps to prevent similar outages in the future, including increasing memory capacity and using more fragmentation-resistant allocators.