ORiGAMi: A Machine Learning Architecture for the Document Model

Company

MongoDB

Date Published

March 11, 2025

Author

Thomas Rueckstiess, Robin Vujanic, Alana Huang

Word count

868

Language

English

Hacker News points

None

URL

www.mongodb.com/blog/post/origami-machine-learning-architecture-for-document-model

Summary

The document model has proven to be the optimal paradigm for modern application schemas, offering superior expressiveness compared to traditional tabular and relational representations. However, machine learning algorithms have faced challenges when working with semi-structured formats like JSON due to their flexible schema accommodating dynamic and nested data structures. To bridge this gap, MongoDB's ML research group developed a novel Transformer-based architecture called ORiGAMi, designed for supervised learning on semi-structured data. This new architecture enables prediction directly from semi-structured documents without the need for cumbersome flattening and manual feature extraction required for tabular data representation. ORiGAMi uses tokenization strategy to transform documents into sequences of key-value pairs and special structural tokens that encode nested types, allowing it to predict any field within a document, including complex types like arrays and nested subdocuments. The architecture includes guardrails to ensure the model only generates valid, well-formed documents and a novel position encoding strategy that respects the order invariance of key/value pairs in JSON. ORiGAMi can be trained on as few as 200 labeled samples and can make predictions for the "user_segment" field on new users immediately after signup without rebuilding feature pipelines. The architecture is now open-sourced, allowing developers to explore its capabilities and contribute to its development.