Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Post Details

Company

Google Cloud

Date Published

May 4, 2026

Author

Weiren Yu, Yarong Mu, Lihao Ran, Zhaoxiang Feng, Yiming Zhao, and Hao Zhang

Word Count

2,239

Language

English

Hacker News Points

-

Source URL

developers.googleblog.com/supercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding

Summary

Researchers at UCSD, led by Hao Zhang, have made a significant advancement in the field of Large Language Model (LLM) acceleration by implementing a novel speculative decoding method called DFlash on Google TPUs. This method, which utilizes block diffusion, allows for the generation of an entire block of candidate tokens in a single forward pass, overcoming the traditional bottleneck of sequential token prediction. By integrating DFlash into the vLLM TPU inference ecosystem, the UCSD team achieved a remarkable average speed increase of 3.13x in tokens per second on TPU v5p, with peak speedups reaching nearly 6x for complex math tasks. This development not only showcases the potential of diffusion-style drafting to leverage the parallel computing capabilities of TPUs but also sets the stage for future innovations in speculative decoding systems, paving the way for Speculative Speculative Decoding (SSD) and broader applications in AI hardware acceleration.