Company
Date Published
Author
Shlomi Noach
Word count
3350
Language
English
Hacker News points
None

Summary

GitHub employs a sophisticated high-availability solution for its MySQL databases, critical to its operations, by utilizing orchestrator for failure detection, Hashicorp’s Consul for service discovery, and GLB/HAProxy for load balancing. This setup replaces the previous reliance on VIP and DNS-based discovery, which had limitations such as potential split-brain scenarios and slower failover processes. The new architecture leverages anycast IP, ensuring uniform IP resolution across different data centers while routing traffic based on client location. Orchestrator nodes use a raft consensus for coordinated failovers, promoting a new primary server when failures are detected, while Consul updates ensure all GLB/HAProxy nodes are aware of the change, minimizing outage times. This system is designed to be data center agnostic, tolerant of isolation issues, and capable of achieving typically lossless failovers, reducing outage times to between 10 and 13 seconds in most cases. Despite its robustness, the setup acknowledges some limitations, such as potential split-brain scenarios during data center isolation, which GitHub is working to mitigate with further improvements like implementing STONITH mechanisms.