Inflated data lakehouse costs and latencies? - Blame S3's choice of HTTP/1.1
Blog post from Onehouse
The performance of cloud object storage platforms like Amazon S3 and Google Cloud Storage (GCS) is significantly influenced by the HTTP protocols they utilize, with S3 relying on HTTP/1.1 and GCS employing HTTP/2. HTTP/1.1's limitations, such as head-of-line blocking and higher latency, lead to inefficiencies and increased costs, as demonstrated by S3 showing up to 15 times higher latency compared to GCS in practical workloads. These inefficiencies arise because HTTP/1.1 lacks the multiplexing and header compression benefits of HTTP/2, resulting in higher TCP overhead and variability in software development kit (SDK) performance. Onehouse addresses these challenges by implementing optimizations like byte-range coalescing and smart concurrency management to improve cost efficiency and performance in data lake operations. The shift from distributed file systems to object storage systems has made HTTP behavior critical in managing data lakes, emphasizing the need for effective protocol management to reduce compute costs and improve throughput. Onehouse's lakehouse platform incorporates these insights to enhance connection management and protocol behavior, ensuring better performance and cost savings for users.