The Discovery of Apache ZooKeeper’s Poison Packet
Blog post from PagerDuty
ZooKeeper, a prominent open-source project known for enabling distributed coordination, encountered significant reliability issues at PagerDuty due to a confluence of bugs in both ZooKeeper and the Linux kernel. These issues resulted in random cluster-wide lockups, largely stemming from two ZooKeeper bugs related to client session overloads and unhandled exceptions in critical threads, and two kernel-related bugs involving TCP payload corruption and checksum validation failures under specific conditions. The investigation revealed that TCP payload corruption was linked to the use of IPSec in Transport Mode combined with certain versions of the Linux kernel and Xen virtualization, which allowed corrupted packets to bypass validation. Further complicating matters, the aesni-intel kernel module was implicated in the corruption during AES encryption. Despite arduous troubleshooting efforts, including downgrading affected systems and blacklisting problematic modules, a definitive fix remains elusive, although workarounds have been implemented to mitigate the issues. The investigation underscores the complex interplay between software components and the challenges in maintaining high reliability in distributed systems.