Unlocking asynchronicity in continuous batching
Blog post from HuggingFace
In this article, the authors explore how asynchronous batching can significantly enhance GPU utilization and performance during inference by allowing CPU and GPU tasks to run concurrently. Traditional synchronous batching results in inefficiencies as the CPU and GPU take turns, leading to idle periods that contribute to nearly a quarter of total runtime. By implementing asynchronous batching with CUDA streams and events, the CPU can prepare the next batch while the GPU processes the current one, minimizing idle time. This approach involves using separate streams for different GPU operations and ensuring synchronization with events to prevent data corruption and ensure data is ready when needed. The method successfully increases GPU active time from 76% to 99.4%, resulting in a 22% speedup in total generation time without requiring new kernels or model changes. The implementation is part of the transformers library, and future articles will explore additional optimizations for further performance improvements.