From compute hours to data moved: a benchmark series
Blog post from dltHub
The blog post explores the efficiency of using dlthub for data movement tasks by examining how compute hours translate into data moved across different bottlenecks in data pipelines. It focuses initially on the performance of SQL copy operations, showing that under optimal conditions, dlthub can move up to 65 GB or approximately 350 million rows of Postgres data to BigQuery in one hour, when source and destination are co-located in the same region. The post outlines plans to benchmark additional scenarios involving REST APIs, JSON files, and Parquet files to provide a comprehensive understanding of different bottlenecks. It highlights that most dlthub pipelines face challenges with REST APIs due to rate limits and with JSON files due to high CPU usage for schema inference. The article also provides cost estimates for typical data operations, suggesting that the monthly expenses for data movement are generally modest. Furthermore, a trial version of dlthub is available for potential users to test its capabilities.