Andres Marafioti and colleagues have introduced significant improvements to the streaming capabilities of the datasets library, allowing for more efficient loading and training on large-scale datasets without downloading them first. These enhancements involve reducing the number of requests by a factor of 100, speeding up data resolution by 10 times, and doubling the streaming speed, which minimizes system crashes during high-concurrency operations. Key improvements include a persistent data files cache and optimized resolution logic to prevent redundant API calls, as well as prefetching for Parquet datasets and configurable buffering, which enhance throughput and ensure the GPU is consistently supplied with data. The use of dedupe-based storage, Xet, accelerates data transfers by avoiding duplicate uploads, and the introduction of customizable streaming pipelines allows for more control over data processing. These updates, which are now part of the datasets and huggingface_hub libraries, make streaming as fast as accessing data from local SSDs, significantly reducing delays in model training workflows.