Cassandra Load Disparities with WAN-based Quorum Reads

Post Details

Company

PagerDuty

Date Published

Nov. 20, 2015

Author

Evan Gilman

Word Count

1,207

Language

English

Hacker News Points

-

Source URL

www.pagerduty.com/blog/uncategorized/cassandra-load-disparities-with-wan-based-quorum-reads

Summary

PagerDuty encountered unexpected performance issues with their Apache Cassandra deployments across geographically distributed datacenters, specifically noticing read hotspots despite uniform data distribution and access patterns. Initially attributing the uneven load to hardware discrepancies, further investigation revealed that the issue stemmed from the system's tendency to select nodes with the fastest response times for read requests. This led to certain nodes, particularly those in the us-west-1 datacenter, being disproportionately burdened due to their proximity and response efficiency. The challenge was compounded by the inherent round-trip time (RTT) variances in their network topology, resulting in a skewed distribution where nodes A-C received 60% more reads compared to others. Despite considering potential solutions, such as redistributing reads or adjusting network distances, the most viable approach was to scale up the overburdened nodes to manage the load effectively. This experience highlighted the complexities of assumptions in system design and the universal nature of such phenomena in distributed systems that favor low-latency operations.