Home / Companies / PagerDuty / Blog / Post Details
Content Deep Dive

Cassandra Load Disparities with WAN-based Quorum Reads

Blog post from PagerDuty

Post Details
Company
Date Published
Author
Evan Gilman
Word Count
1,207
Language
English
Hacker News Points
-
Summary

PagerDuty encountered unexpected performance issues with their Apache Cassandra deployments across geographically distributed datacenters, specifically noticing read hotspots despite uniform data distribution and access patterns. Initially attributing the uneven load to hardware discrepancies, further investigation revealed that the issue stemmed from the system's tendency to select nodes with the fastest response times for read requests. This led to certain nodes, particularly those in the us-west-1 datacenter, being disproportionately burdened due to their proximity and response efficiency. The challenge was compounded by the inherent round-trip time (RTT) variances in their network topology, resulting in a skewed distribution where nodes A-C received 60% more reads compared to others. Despite considering potential solutions, such as redistributing reads or adjusting network distances, the most viable approach was to scale up the overburdened nodes to manage the load effectively. This experience highlighted the complexities of assumptions in system design and the universal nature of such phenomena in distributed systems that favor low-latency operations.