Transformers documentation

Distributed GPU inference

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Distributed GPU inference

Tensor parallelism shards a model onto multiple GPUs and parallelizes computations such as matrix multiplication. It enables fitting larger model sizes into memory and is faster because each GPU can process a tensor slice.

Expand the list below to see which models support tensor parallelism. Open a GitHub issue or pull request to add support for a model not currently below.

Supported models

Set tp_plan="auto" in from_pretrained() to enable tensor parallelism for inference.

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


# enable tensor parallelism
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    tp_plan="auto",
)

# prepare input tokens
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

# distributed run
outputs = model(inputs)

Launch the inference script above on torchrun with 4 processes per GPU.

torchrun --nproc-per-node 4 demo.py

For CPU, please binding different socket on each rank. For example, if you are using Intel 4th Gen Xeon:

export OMP_NUM_THREADS=56
numactl -C 0-55 -m 0 torchrun --nnodes=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 demo.py & numactl -C 56-111 -m 1 torchrun --nnodes=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 --nproc-per-node 1 demo.py & wait

The CPU benchmark data will be released soon.

You can benefit from considerable speed ups for inference, especially for inputs with large batch size or long sequences.

For a single forward pass on Llama with a sequence length of 512 and various batch sizes, you can expect the following speed ups.

< > Update on GitHub