Cassandra Load Disparities with WAN-based Quorum Reads
Blog post from PagerDuty
PagerDuty encountered unexpected performance issues with their Apache Cassandra deployments across geographically distributed datacenters, specifically noticing read hotspots despite uniform data distribution and access patterns. Initially attributing the uneven load to hardware discrepancies, further investigation revealed that the issue stemmed from the system's tendency to select nodes with the fastest response times for read requests. This led to certain nodes, particularly those in the us-west-1 datacenter, being disproportionately burdened due to their proximity and response efficiency. The challenge was compounded by the inherent round-trip time (RTT) variances in their network topology, resulting in a skewed distribution where nodes A-C received 60% more reads compared to others. Despite considering potential solutions, such as redistributing reads or adjusting network distances, the most viable approach was to scale up the overburdened nodes to manage the load effectively. This experience highlighted the complexities of assumptions in system design and the universal nature of such phenomena in distributed systems that favor low-latency operations.