Home / Companies / MongoDB / Blog / Post Details
Content Deep Dive

Implementing an Online Parquet Shredder

Blog post from MongoDB

Post Details
Company
Date Published
Author
-
Word Count
5,743
Language
English
Hacker News Points
-
Summary

Aaron Gorenstein discusses significant improvements made to the BSON-to-Parquet conversion process in Atlas Data Federation, a feature that enables customers to export data from an Atlas cluster to their own blob storage as a Parquet file. The new approach, introduced in February 2023, involves an efficient "Online Shredder" algorithm that allows for a single-pass conversion, improving CPU and memory usage and resulting in a two-fold increase in throughput without any performance drawbacks for customers. This was achieved by addressing challenges related to MongoDB's flexible schema and Parquet's fixed schema through innovative schema-building and document shredding techniques, including lazy syncing of definition levels. The implementation was a collaborative effort involving experts across different teams, resulting in a robust solution that enhances performance and efficiency in data processing, while ensuring compatibility with third-party Parquet readers.