Python DataFrames (Bodo, Daft, Polars, PySpark, Dask, Modin/Ray) Compete for Your NYC Taxi Fare
Blog post from Bodo
The third part of the series on Python DataFrames revisits the NYC Taxi benchmark to evaluate the performance of Bodo DataFrames, a high-performance, scalable alternative to Pandas that maintains the familiar Pandas API with minimal code changes. Bodo DataFrames leverages a C++ backend and Bodo JIT compiler to deliver significant speed improvements, comparable to the Bodo JIT compiler alone, while outperforming other systems like Daft, Polars, PySpark, Dask, and Modin/Ray by 2x–250x. The library excels in processing data larger than available memory through streaming and spilling capabilities, making it an attractive option for large-scale Pandas workloads without needing extensive code rewrites. This installment highlights Bodo DataFrames' ability to provide top-tier performance and seamless scalability across single-node and multi-node setups, while preserving Pandas idioms and minimizing developer effort, thus offering an efficient solution for data engineering pipelines.