Home / Companies / Mux / Blog / Post Details
Content Deep Dive

What we learned from a 22-Day storage bug (and how we fixed it)

Blog post from Mux

Post Details
Company
Mux
Date Published
Author
Drew Rodman
Word Count
2,713
Language
English
Hacker News Points
-
Summary

Mux Video faced an incident between January 8th and February 4th where approximately 0.33% of video and audio segments were served in a corrupted state, causing playback issues like audio dropouts and visual stuttering for some viewers. The problem stemmed from a combination of factors, including context cancellation in remote reads, race conditions in file deletions, and operational slowdowns due to a scaling change in their storage nodes. Mux has since fixed the immediate causes by addressing the race condition, resolving the context cancellation issue, and adjusting storage node counts to remove bottlenecks. They also regenerated affected segments and purged corrupted segments from CDN caches to prevent further issues. Mux is committed to improving system observability and support escalation processes to prevent similar incidents and ensure transparency with their customers.