Trino for Large-Scale ETL @ Lyft
Blog post from Starburst
Lyft has successfully integrated Trino into its data platform for large-scale ETL processes, transitioning from Hive to achieve significant improvements in performance, cost-efficiency, and reliability. Trino, initially used for smaller ad hoc queries, now supports Lyft's extensive data operations, handling 2.5PB of daily read data and 60TB of write data across 60,000 daily queries. The transition involved challenges such as the noisy neighbor problem and frequent releases, but resulted in notable reductions in ETL runtimes, with some jobs dropping from five hours to just 20 minutes. By migrating to AWS Graviton Instances, Lyft achieved a 10% reduction in hosting costs, while autoscaling enhanced ETL efficiency by dynamically adjusting cluster capacity based on demand. Future plans include improving reliability, exploring sharding at the orchestration layer, enabling fault-tolerant execution, and potentially adopting Tardigrade, reflecting the team's satisfaction and ongoing commitment to optimizing their data infrastructure with Trino.