Company
Date Published
Author
Tom Hacohen
Word count
1213
Language
English
Hacker News points
None

Summary

We suffered a partial outage on Saturday the 11th of March due to OOM errors in our containers, which resulted in most API calls returning 5xx errors in the US region. The issue was not caused by code changes or unusual traffic, but rather an unexpected memory usage spike that led to repeated container killings. We suspect an edge case with a library used for serialization may be responsible, and have since reproduced the issue locally and mitigated it using a different allocator. The incident highlights the importance of monitoring and understanding AWS metrics, as well as the need for additional observability tools. Our team is taking steps to prevent similar outages in the future, including increasing memory capacity and using more fragmentation-resistant allocators.