TorchTPU: Running PyTorch Natively on TPUs at Google Scale
Blog post from Google Cloud
TorchTPU is an advanced integration designed to enable PyTorch to run natively and efficiently on Google's Tensor Processing Units (TPUs), which are critical to handling modern AI infrastructure demands involving large-scale distributed systems. Developed with a focus on usability, portability, and performance, TorchTPU allows developers to migrate existing PyTorch workloads with minimal code changes while maximizing the computational capabilities of TPUs. The system offers three eager execution modes (Debug, Strict, and Fused Eager) to enhance flexibility and performance, with Fused Eager providing significant performance improvements through automated operation fusion. TorchTPU also integrates with PyTorch's torch.compile interface for full-graph compilation, utilizing the XLA backend to optimize dense computations and communications. This integration supports various distributed training setups and overcomes previous limitations by accommodating divergent executions (MPMD) alongside standard SPMD optimizations. TorchTPU's architecture encourages optimal model designs tailored to TPU hardware characteristics, and future developments aim to reduce recompilation overhead, expand custom kernel capabilities, and support dynamic shapes directly through PyTorch's interfaces. As Google works towards a comprehensive PyTorch experience on TPUs, the team is striving to address open challenges and expand the system's capabilities by 2026 and beyond, enhancing AI model efficiency and scalability.