Home / Companies / Vantage / Blog / Post Details
Content Deep Dive

Why Your S3 Bill Jumped After You Started Doing Data Engineering

Blog post from Vantage

Post Details
Company
Date Published
Author
Emily Dunenfeld
Word Count
1,009
Language
English
Hacker News Points
-
Summary

Distributed SQL query engines like Trino and Amazon Athena enhance the ability to run fast, interactive queries on large data sets but can significantly increase costs on storage systems like Amazon S3 due to high volumes of GET requests. While Athena's straightforward pricing structure includes charges per terabyte of data scanned, the additional costs from S3 requests can surpass storage expenses, especially when querying numerous objects. The underlying issue lies in data storage not being optimized with processing methods, which Apache Iceberg can address by improving the Hive table format to support updates and deletes, reducing the number of objects and GET requests. Implementing caching strategies with technologies such as Amazon CloudFront or ElastiCache can further mitigate costs by minimizing repetitive requests, although these strategies can be complex when dealing with unpredictable query patterns. For example, Starburst's Warp Speed dynamically updates caches based on workload patterns, significantly reducing costs and improving query performance. Overall, while Trino represents a new generation of powerful query engines, newer technologies like Iceberg and strategic caching can optimize cost-efficiency and performance in data analytics workflows.