Home / Companies / Bodo / Blog / Post Details
Content Deep Dive

Python data processing engine comparison with NYC Taxi trips (Bodo vs. Spark, Dask, Ray)

Blog post from Bodo

Post Details
Company
Date Published
Author
Ehsan Totoni
Word Count
1,495
Language
English
Hacker News Points
-
Summary

Python has become a popular choice for data engineers and data scientists, but scaling Python code efficiently remains a challenge. Compute engines like Bodo, Spark, Dask, and Ray/Modin aim to bridge this gap, offering Python scaling while striving for high performance. A recent benchmark tested the performance of these engines on a Python program that computes the summary of monthly trips with precipitation data on the NYC Taxi public dataset. The results reveal massive performance differences: Bodo delivered a 20x speedup over Spark (95% cost savings), 50x over Dask (98% cost savings), and a staggering 250x over Ray/Modin (99% cost savings). This is attributed to Bodo's HPC-based compiler approach, which differs from the distributed task scheduling design of other engines. The benchmark was conducted on a 4-node cluster setup on AWS, using a smaller subset of the dataset to allow local execution on a laptop. Bodo shows a roughly 4x improvement over Pandas, while other engines can be substantially slower than regular Pandas. The advantages of Bodo's architecture and design make it a strong competitor to existing engines like Spark, Dask, and Ray, offering unparalleled speed, ease-of-use, and cost efficiency for compute-heavy workloads.