Streaming datasets: 100x More Efficient

Company

HuggingFace

Date Published

Oct. 27, 2025

Author

Andres Marafioti, Quentin Lhoest, ben burtenshaw, Pedro Cuenca, and merve

Word count

1306

Language

Hacker News points

None

URL

huggingface.co/blog/streaming-datasets

Summary

Andres Marafioti and colleagues have introduced significant improvements to the streaming capabilities of the datasets library, allowing for more efficient loading and training on large-scale datasets without downloading them first. These enhancements involve reducing the number of requests by a factor of 100, speeding up data resolution by 10 times, and doubling the streaming speed, which minimizes system crashes during high-concurrency operations. Key improvements include a persistent data files cache and optimized resolution logic to prevent redundant API calls, as well as prefetching for Parquet datasets and configurable buffering, which enhance throughput and ensure the GPU is consistently supplied with data. The use of dedupe-based storage, Xet, accelerates data transfers by avoiding duplicate uploads, and the introduction of customizable streaming pipelines allows for more control over data processing. These updates, which are now part of the datasets and huggingface_hub libraries, make streaming as fast as accessing data from local SSDs, significantly reducing delays in model training workflows.