Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

How to Use Multiple GPUs in Hugging Face Transformers: Device Map vs Tensor Parallelism

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Aritra Roy Gosthipaty
Word Count
606
Language
-
Hacker News Points
-
Summary

To leverage multiple GPUs with Hugging Face transformers, two primary methods are discussed: device_map and Tensor Parallelism. The device_map approach is suitable for large models that cannot fit on a single GPU and is primarily used for inference by splitting model layers across GPUs for memory efficiency, although it does not offer true parallel speed-up. In contrast, Tensor Parallelism enables real multi-GPU computation by distributing large tensor operations, such as matrix multiplications, across GPUs, which results in faster inference and better scaling but requires a more complex, distributed setup with tools like torchrun. Setting the CUDA_VISIBLE_DEVICES environment variable is crucial to control GPU visibility and ensure that only specified GPUs are utilized during model execution.