Why Your S3 Bill Jumped After You Started Doing Data Engineering

Post Details

Company

Vantage

Date Published

March 21, 2024

Author

Emily Dunenfeld

Word Count

1,009

Language

English

Hacker News Points

-

Source URL

www.vantage.sh/blog/s3-bill-increase-athena-trino-hive-fix-iceberg-caching

Summary

Distributed SQL query engines like Trino and Amazon Athena enhance the ability to run fast, interactive queries on large data sets but can significantly increase costs on storage systems like Amazon S3 due to high volumes of GET requests. While Athena's straightforward pricing structure includes charges per terabyte of data scanned, the additional costs from S3 requests can surpass storage expenses, especially when querying numerous objects. The underlying issue lies in data storage not being optimized with processing methods, which Apache Iceberg can address by improving the Hive table format to support updates and deletes, reducing the number of objects and GET requests. Implementing caching strategies with technologies such as Amazon CloudFront or ElastiCache can further mitigate costs by minimizing repetitive requests, although these strategies can be complex when dealing with unpredictable query patterns. For example, Starburst's Warp Speed dynamically updates caches based on workload patterns, significantly reducing costs and improving query performance. Overall, while Trino represents a new generation of powerful query engines, newer technologies like Iceberg and strategic caching can optimize cost-efficiency and performance in data analytics workflows.