Using Storage Buckets as a Working Layer for Data Pipelines
Blog post from HuggingFace
Daniel van Strien discusses the transition of a data pipeline to use Storage Buckets and HF Jobs for scheduling, highlighting the advantages of this approach over the previous method. The old pipeline, which relied on GitHub Actions, was inefficient due to its requirement to download, merge, and re-upload large datasets frequently. The new system utilizes mutable, non-versioned Storage Buckets powered by Xet, which efficiently deduplicates data and allows for faster, incremental uploads. This method suits pipelines where data is incrementally collected and processed before publishing, ensuring that each stage writes forward without modifying previous data. Fetch jobs append data to the bucket, while a compile job processes and publishes it to a versioned repository, maintaining fault tolerance and enabling regeneration of the dataset if needed. Scheduling is managed with HF Jobs, which offers flexibility and secure handling of secrets, ensuring efficient resource usage and easy updates to the pipeline.