Company
Date Published
Author
Thea Heinen
Word count
3135
Language
English
Hacker News points
None

Summary

Cloudflare's team encountered a rare bug in Go's arm64 compiler due to their vast scale of handling 84 million HTTP requests per second, leading to the discovery of a race condition in the generated code. Initially, sporadic panics were observed in their arm64 machines, prompting a deeper investigation after these incidents increased without a clear cause. The panics were linked to a Go Netlink library used in their systems, which involved unsafe memory access and async preemption. The root cause was identified as a Go runtime bug, where async preemption during stack pointer adjustments led to crashes due to invalid stack pointers during garbage collection and stack unwinding processes. After isolating the issue and replicating it in a controlled environment, the bug was reported and subsequently fixed in newer Go versions. This experience highlighted the complexity and challenge of debugging rare race conditions at such a large scale, emphasizing the importance of understanding low-level runtime mechanics in resolving such issues.