Trino for Large-Scale ETL @ Lyft

Post Details

Company

Starburst

Date Published

Jan. 25, 2023

Author

Charles Song

Word Count

876

Company Posts That Month

9

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.starburst.io/blog/trino-for-large-scale-etl-lyft

Summary

Lyft has successfully integrated Trino into its data platform for large-scale ETL processes, transitioning from Hive to achieve significant improvements in performance, cost-efficiency, and reliability. Trino, initially used for smaller ad hoc queries, now supports Lyft's extensive data operations, handling 2.5PB of daily read data and 60TB of write data across 60,000 daily queries. The transition involved challenges such as the noisy neighbor problem and frequent releases, but resulted in notable reductions in ETL runtimes, with some jobs dropping from five hours to just 20 minutes. By migrating to AWS Graviton Instances, Lyft achieved a 10% reduction in hosting costs, while autoscaling enhanced ETL efficiency by dynamically adjusting cluster capacity based on demand. Future plans include improving reliability, exploring sharding at the orchestration layer, enabling fault-tolerant execution, and potentially adopting Tardigrade, reflecting the team's satisfaction and ongoing commitment to optimizing their data infrastructure with Trino.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Data Pipeline	20	475	100	40	-27%
Platform Engineering	3	224	44	29	+59%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.