Efficient LLM Pretraining: Packed Sequences and Masked Attention
Blog post from HuggingFace
Efficient pretraining of large language models (LLMs) can be enhanced by using packed sequences and masked attention to optimize computational resources. Packed sequences involve concatenating shorter text sequences into a single, longer sequence instead of padding them, which reduces wasted GPU memory and allows processing more tokens per batch, thereby shortening training times. However, to prevent models from attending across sequence boundaries, careful attention masks need to be constructed to ensure tokens from different sequences are not mistakenly linked. Additionally, position IDs should be adjusted so that each sequence starts from the beginning, marking clear boundaries and preventing the packed data from being treated as a continuous sequence. This technique, while potentially incompatible with certain attention implementations, offers a strategic method to improve the efficiency of LLM training, as discussed with references to specific implementations and community feedback on its application.