A merge queue is critical infrastructure. Build it accordingly.
Blog post from Mergify
Julien Danjou emphasizes the critical importance of building a merge queue with the same rigor as critical infrastructure, using the recent incident on April 23 at GitHub as a cautionary tale. During this event, GitHub's merge queue silently corrupted merges for over four hours, causing merged code to diverge from what was tested, without triggering any alerts or audit trail discrepancies. The core issue arises from the structural nature of merge queues, where the commit tested must exactly match the commit merged, a principle violated by GitHub's replay-after-CI strategy, leading to potential silent corruption. The article highlights that at platform scale, every edge case in a merge queue can have unique failure paths, emphasizing that a merge queue must be treated as a standalone product rather than just another feature, with investments in formal verification to prevent such silent failures. For teams affected by the GitHub incident, Danjou suggests a thorough audit of merge commits, comparing tree hashes to detect discrepancies, although he notes this process can be complex and teams should maintain comprehensive records to facilitate such audits.