Home / Companies / Pydantic / Blog / Post Details
Content Deep Dive

Teaching DataFusion to read struct fields efficiently

Blog post from Pydantic

Post Details
Company
Date Published
Author
-
Word Count
1,468
Language
English
Hacker News Points
-
Summary

Matthew, who works on the Fusionfire database that supports the Pydantic Logfire observability platform, discusses the challenges and solutions related to storing logs as JSON columns and their inefficiencies due to parsing and validation overheads. To address these issues, he has been working on implementing Variant type support in DataFusion, an efficient binary encoding for JSON-like data, which will standardize shredding specifications and improve interoperability and performance. The migration to Variant is planned for the next quarter, and it will allow JSON data to be stored in Parquet as typed struct columns. The article highlights the improvements in querying these struct columns using DataFusion, which now supports struct field pushdown, enabling efficient querying by operating directly on the necessary leaf columns rather than treating the entire struct as an opaque unit. These advancements significantly enhance the querying speed by allowing for projection pruning, filter pushdown, and row group pruning, which were previously inefficient for struct fields. The results show substantial performance gains, especially when querying wide structs with large sibling columns, by reducing unnecessary data processing, marking a significant improvement in handling semi-structured data in the Parquet ecosystem with the advent of Variant as a first-class type.