Home / Companies / Google Cloud / Blog / Post Details
Content Deep Dive

Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket

Blog post from Google Cloud

Post Details
Company
Date Published
Author
Trinadh Kotturu, and Martin Durant
Word Count
703
Language
English
Hacker News Points
-
Summary

Google Cloud has announced a significant advancement for AI/ML workloads using the PyTorch ecosystem by integrating Rapid Storage, powered by Google's Colossus storage architecture, via the fsspec interface. This integration aims to address the bottleneck challenges of data loading and checkpointing that arise as model sizes increase, ensuring GPUs remain efficiently utilized. The new Rapid Bucket solution offers high-performance object storage using gRPC bidirectional streams, bypassing traditional REST APIs, which significantly enhances throughput and reduces latency. With its direct connectivity and zonal co-location, Rapid Storage achieves an aggregate throughput of over 15 TiB/s and ultra-low latency of under 1ms for various operations. This is seamlessly integrated into existing systems without requiring extensive code rewrites, allowing developers to enjoy significant performance improvements by simply switching to Rapid Buckets. Testing has demonstrated a 23% performance gain compared to standard regional buckets, with notable improvements in both read and write throughput.