How We Formally Verified Our Merge Queue with TLA+ (and Found Bugs That Tests Missed)
Blog post from Mergify
Julien Danjou discusses the use of TLA+ to formally verify a merge queue state machine, uncovering bugs missed by traditional testing and years of production. By modeling the queue as a train where each car represents one or more pull requests (PRs), the TLA+ model checker, TLC, explored 468,000 states, identifying bugs related to the interaction of features like freezing, batching, and speculative CI. The process revealed two significant bugs: one where frozen PRs erroneously proceeded to merge, and another where speculative skips promoted untested code, both of which were difficult to catch with conventional testing. The exercise emphasized the importance of thoughtful abstraction in the model and highlighted the maintenance challenge of keeping the TLA+ specification aligned with the Python codebase. Danjou concludes that while the TLA+ model is an effective tool for identifying subtle bugs, it requires careful consideration of invariants and abstraction boundaries to ensure meaningful verification.