Company
Date Published
Author
Aaron Gorenstein
Word count
4002
Language
English
Hacker News points
None

Summary

The MongoDB team has released a new online parquet shredder that improves performance and efficiency in converting data from their native BSON format to the fixed-schema, columnar Parquet format. The new shredder uses an improved algorithm that enables shredding a stream of documents into a columnar format in a single pass, building the schema in parallel. This approach addresses challenges such as handling missing or new fields, polymorphism, and structural metadata. The shredder also introduces lazy syncing, which allows it to efficiently maintain the structural metadata without compromising performance. The MongoDB team has built the shredder using the apache-go parquet writer library and has validated its behavior through various tests, including unit-tests, integration tests, and end-to-end tests. The new feature is expected to improve performance for customers who use Parquet with their MongoDB clusters.