Scaling Distributed Joins

Company

SingleStore

Date Published

Dec. 15, 2017

Author

Adam Prout

Word count

1370

Language

English

Hacker News points

None

URL

www.singlestore.com/blog/scaling-distributed-joins

Summary

Distributed joins in SQL databases are a complex topic that can be misunderstood by many users. While some people claim that distributed joins don't scale, it's true that different algorithms involve varying levels of data movement and trade-offs. There are five types of distributed joins: local/collocated reference table join, local/collocated distributed table join, remote distributed table join, broadcast join, and reshuffle join. The first two types are very scalable with no data movement, but require schema tuning and have limitations in terms of flexibility. The third type is more flexible than the fourth, which involves sending only a small subset of rows to every node in the cluster for local joining. The fifth type is the most expensive but also the most flexible, involving reshuffling both tables involved in the join. To make distributed joins scalable, it's essential to minimize data movement and choose shard key columns that are commonly joined on. Additionally, some queries can be optimized by restricting the rows involved in the join or running them as remote distributed joins when possible.