Home / Companies / Galileo / Blog / Post Details
Content Deep Dive

How We Scaled Data Quality at Galileo

Blog post from Galileo

Post Details
Company
Date Published
Author
Ben Epstein
Word Count
4,324
Language
English
Hacker News Points
-
Summary

At Galileo, the team aimed to enable machine learning engineers to surface critical issues in their datasets efficiently and quickly. They faced a unique challenge due to the large size of their data, which was not addressed by other model-centric ML tools. To address this, they explored various options, including using out-of-core data handling tools like pandas with chunking and pyarrow. However, these approaches proved insufficient or too slow for their needs. The team then turned to Vaex, a little-known project that offers an out-of-core hybrid Apache Arrow/NumPy DataFrame solution for big tabular data. Vaex provided the necessary scalability, performance, and flexibility to handle massive datasets with low latency. By leveraging its memory-mappable file format, chunking, and caching capabilities, the team was able to efficiently process large amounts of data, including single columns of high-dimensional vectors, and apply custom NumPy expressions using the `register_function` decorator. Vaex's performance and scalability were significantly better than other tools in handling high-dimensional data, making it an ideal choice for their platform.