VideoModelStudio / docs /finetrainers /documentation_parallel_processing_README.md
jbilcke-hf's picture
jbilcke-hf HF staff
upgrading our code to support the new finetrainers
d464085

Finetrainers Parallel Backends

Finetrainers supports parallel training on multiple GPUs & nodes. This is done using the Pytorch DTensor backend. To run parallel training, torchrun is utilized.

As an experiment for comparing performance of different training backends, Finetrainers has implemented multi-backend support. These backends may or may not fully rely on Pytorch's distributed DTensor solution. Currently, only 🤗 Accelerate is supported for backwards-compatibility reasons (as we initially started Finetrainers with only Accelerate). In the near future, there are plans for integrating with:

The multi-backend support is completely experimental and only serves to satisfy my curiosity of how much of a tradeoff there is between performance and ease of use. The Pytorch DTensor backend is the only one with stable support, following Accelerate.

Users will not have to worry about backwards-breaking changes or dependencies if they stick to the Pytorch DTensor backend.

Support matrix

There are various algorithms for parallel training. Currently, we only support:

Training

The following parameters are relevant for launching training:

  • parallel_backend: The backend to use for parallel training. Available options are ptd & accelerate.
  • pp_degree: The degree of pipeline parallelism. Currently unsupported.
  • dp_degree: The degree of data parallelis/replicas. Defaults to 1.
  • dp_shards: The number of shards for data parallelism. Defaults to 1.
  • cp_degree: The degree of context parallelism. Currently unsupported.
  • tp_degree: The degree of tensor parallelism.

For launching training with the Pytorch DTensor backend, use the following:

# Single node - 8 GPUs available
torchrun --standalone --nodes=1 --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint="localhost:0" train.py <YOUR_OTHER_ARGS>

# Single node - 8 GPUs but only 4 available
export CUDA_VISIBLE_DEVICES=0,2,4,5
torchrun --standalone --nodes=1 --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint="localhost:0" train.py <YOUR_OTHER_ARGS>

# Multi-node - Nx8 GPUs available
# TODO(aryan): Add slurm script

For launching training with the Accelerate backend, use the following:

# Single node - 8 GPUs available
accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids 0,1,2,3,4,5,6,7 train.py <YOUR_OTHER_ARGS>

# Single node - 8 GPUs but only 4 available
accelerate launch --config_file accelerate_configs/uncompiled_4.yaml --gpu_ids 0,2,4,5 train.py <YOUR_OTHER_ARGS>

# Multi-node - Nx8 GPUs available
# TODO(aryan): Add slurm script