Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding
Blog post from Google Cloud
Researchers at UCSD, led by Hao Zhang, have made a significant advancement in the field of Large Language Model (LLM) acceleration by implementing a novel speculative decoding method called DFlash on Google TPUs. This method, which utilizes block diffusion, allows for the generation of an entire block of candidate tokens in a single forward pass, overcoming the traditional bottleneck of sequential token prediction. By integrating DFlash into the vLLM TPU inference ecosystem, the UCSD team achieved a remarkable average speed increase of 3.13x in tokens per second on TPU v5p, with peak speedups reaching nearly 6x for complex math tasks. This development not only showcases the potential of diffusion-style drafting to leverage the parallel computing capabilities of TPUs but also sets the stage for future innovations in speculative decoding systems, paving the way for Speculative Speculative Decoding (SSD) and broader applications in AI hardware acceleration.