Debugging distributed database mysteries with Rust, packet capture and Polars
Blog post from QuestDB
QuestDB, an open-source time-series database known for handling demanding workloads with ultra-low latency and high throughput, faced a network bandwidth issue with its primary-replica replication feature, which prompted the creation of a custom network profiling tool. The replication process involves compressing and uploading Write-Ahead Log (WAL) files to an object store, which should ideally keep the outbound bandwidth usage proportional to the inbound ingestion rate. However, during testing, the outbound usage was disproportionately high and growing. To diagnose this, a tool was developed using Rust's pcap crate for packet capture, and the data was analyzed using Python and Polars. This process revealed that the entire transaction metadata was being re-uploaded unnecessarily, leading to excessive bandwidth usage. By distributing metadata across multiple files for incremental uploads, the problem was resolved, resulting in more efficient bandwidth usage than the initial ingestion process. The tool not only facilitated this fix but also contributed to fine-tuning the replication algorithm and creating a replication tuning guide for QuestDB.