Rethinking Table File Paths with Uber: Lance’s Multi-Base Layout
Blog post from LanceDB
Table formats like Iceberg, Delta Lake, and Lance provide a structured way to manage data, primarily through their path management strategies, which significantly impact the portability and operational complexity of data at scale. While Iceberg originally used absolute paths for file references, it is transitioning to relative paths to enhance portability without the need for path rewrites during relocation. Conversely, Delta Lake began with relative paths to ensure zero-rewrite portability but later incorporated absolute paths to accommodate features like shallow cloning. Lance, however, prioritizes predictability and strict portability by using a fixed directory structure and exclusively relative paths, allowing datasets to be copied without metadata modifications. Lance introduces a multi-base path model, enabling a single dataset to span multiple storage locations while maintaining maximum portability, as demonstrated by Uber's AI infrastructure team, which required distributing datasets across multiple S3 buckets. This model supports various use cases, including multi-region data distribution, efficient disaster recovery, and AI experimentation workflows, by explicitly defining base paths in the manifest, thereby simplifying operational tasks like garbage collection and credential management.