TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Post Details

Company

Google Cloud

Date Published

April 7, 2026

Author

Claudio Basile, Kat Ko, Ben Wilson, Lee Howes, Bill Jia, Joe Pamer, Michael Voznesensky, and Robert Hundt

Word Count

1,460

Language

English

Hacker News Points

-

Source URL

developers.googleblog.com/torchtpu-running-pytorch-natively-on-tpus-at-google-scale

Summary

TorchTPU is an advanced integration designed to enable PyTorch to run natively and efficiently on Google's Tensor Processing Units (TPUs), which are critical to handling modern AI infrastructure demands involving large-scale distributed systems. Developed with a focus on usability, portability, and performance, TorchTPU allows developers to migrate existing PyTorch workloads with minimal code changes while maximizing the computational capabilities of TPUs. The system offers three eager execution modes (Debug, Strict, and Fused Eager) to enhance flexibility and performance, with Fused Eager providing significant performance improvements through automated operation fusion. TorchTPU also integrates with PyTorch's torch.compile interface for full-graph compilation, utilizing the XLA backend to optimize dense computations and communications. This integration supports various distributed training setups and overcomes previous limitations by accommodating divergent executions (MPMD) alongside standard SPMD optimizations. TorchTPU's architecture encourages optimal model designs tailored to TPU hardware characteristics, and future developments aim to reduce recompilation overhead, expand custom kernel capabilities, and support dynamic shapes directly through PyTorch's interfaces. As Google works towards a comprehensive PyTorch experience on TPUs, the team is striving to address open challenges and expand the system's capabilities by 2026 and beyond, enhancing AI model efficiency and scalability.