Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket
Blog post from Google Cloud
Google Cloud has announced a significant advancement for AI/ML workloads using the PyTorch ecosystem by integrating Rapid Storage, powered by Google's Colossus storage architecture, via the fsspec interface. This integration aims to address the bottleneck challenges of data loading and checkpointing that arise as model sizes increase, ensuring GPUs remain efficiently utilized. The new Rapid Bucket solution offers high-performance object storage using gRPC bidirectional streams, bypassing traditional REST APIs, which significantly enhances throughput and reduces latency. With its direct connectivity and zonal co-location, Rapid Storage achieves an aggregate throughput of over 15 TiB/s and ultra-low latency of under 1ms for various operations. This is seamlessly integrated into existing systems without requiring extensive code rewrites, allowing developers to enjoy significant performance improvements by simply switching to Rapid Buckets. Testing has demonstrated a 23% performance gain compared to standard regional buckets, with notable improvements in both read and write throughput.