Python data processing engine comparison with NYC Taxi trips (Bodo vs. Spark, Dask, Ray)

Post Details

Company

Bodo

Date Published

Jan. 20, 2025

Author

Ehsan Totoni

Word Count

1,495

Language

English

Hacker News Points

-

Source URL

www.bodo.ai/blog/python-data-processing-engine-comparison-with-nyc-taxi-trips-bodo-vs-spark-dask-ray

Summary

Python has become a popular choice for data engineers and data scientists, but scaling Python code efficiently remains a challenge. Compute engines like Bodo, Spark, Dask, and Ray/Modin aim to bridge this gap, offering Python scaling while striving for high performance. A recent benchmark tested the performance of these engines on a Python program that computes the summary of monthly trips with precipitation data on the NYC Taxi public dataset. The results reveal massive performance differences: Bodo delivered a 20x speedup over Spark (95% cost savings), 50x over Dask (98% cost savings), and a staggering 250x over Ray/Modin (99% cost savings). This is attributed to Bodo's HPC-based compiler approach, which differs from the distributed task scheduling design of other engines. The benchmark was conducted on a 4-node cluster setup on AWS, using a smaller subset of the dataset to allow local execution on a laptop. Bodo shows a roughly 4x improvement over Pandas, while other engines can be substantially slower than regular Pandas. The advantages of Bodo's architecture and design make it a strong competitor to existing engines like Spark, Dask, and Ray, offering unparalleled speed, ease-of-use, and cost efficiency for compute-heavy workloads.