How to Use Multiple GPUs in Hugging Face Transformers: Device Map vs Tensor Parallelism
Blog post from HuggingFace
To leverage multiple GPUs with Hugging Face transformers, two primary methods are discussed: device_map and Tensor Parallelism. The device_map approach is suitable for large models that cannot fit on a single GPU and is primarily used for inference by splitting model layers across GPUs for memory efficiency, although it does not offer true parallel speed-up. In contrast, Tensor Parallelism enables real multi-GPU computation by distributing large tensor operations, such as matrix multiplications, across GPUs, which results in faster inference and better scaling but requires a more complex, distributed setup with tools like torchrun. Setting the CUDA_VISIBLE_DEVICES environment variable is crucial to control GPU visibility and ensure that only specified GPUs are utilized during model execution.