AWS Trainium & Inferentia documentation

Optimum Neuron Distributed

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Optimum Neuron Distributed

The optimum.neuron.distributed module provides a set of tools to perform distributed training and inference.

Parallelization

Selecting Model-Specific Parallelizer Classes

Each model that supports parallelization in optimum-neuron has its own derived Parallelizer class. The factory class ParallelizersManager allows you to retrieve such model-specific Parallelizers easily.

class optimum.neuron.distributed.ParallelizersManager

< >

( )

get_supported_model_types

< >

( )

Provides the list of supported model types for parallelization.

is_model_supported

< >

( model_type_or_model: typing.Union[str, transformers.modeling_utils.PreTrainedModel, optimum.neuron.utils.peft_utils.NeuronPeftModel] )

Parameters

  • model_type_or_model (Union[str, PreTrainedModel]) — Either the model type or an instance of the model.

Returns a tuple of 3 booleans where:

  • The first element indicates if tensor parallelism can be used for this model,
  • The second element indicates if sequence parallelism can be used on top of tensor parallelism for this model,
  • The third element indicates if pipeline parallelism can be used for this model.

parallelizer_for_model

< >

( model_type_or_model: typing.Union[str, transformers.modeling_utils.PreTrainedModel, optimum.neuron.utils.peft_utils.NeuronPeftModel] )

Parameters

  • model_type_or_model (Union[str, PreTrainedModel]) — Either the model type or an instance of the model.

Returns the parallelizer class associated to the model.

Utils

Lazy Loading

Distributed training / inference is usually needed when the model is too big to fit in one device. Tools that allow for lazy loading of model weights and optimizer states are thus needed to avoid going out-of-memory before parallelization.

optimum.neuron.distributed.lazy_load_for_parallelism

< >

( tensor_parallel_size: int = 1 pipeline_parallel_size: int = 1 )

Parameters

  • tensor_parallel_size (int, defaults to 1) — The tensor parallel size considered.
  • pipeline_parallel_size (int, defaults to 1) — The pipeline parallel size considered.

Context manager that makes the loading of a model lazy for model parallelism:

  • Every torch.nn.Linear is put on the torch.device("meta") device, meaning that it takes no memory to instantiate.
  • Every torch.nn.Embedding is also put on the torch.device("meta") device.
  • No state dict is actually loaded, instead a weight map is created and attached to the model. For more information, read the optimum.neuron.distributed.utils.from_pretrained_for_mp docstring.

If both tensor_parallel_size and pipeline_parallel_size are set to 1, no lazy loading is performed.

optimum.neuron.distributed.make_optimizer_constructor_lazy

< >

( optimizer_cls: typing.Type[ForwardRef('torch.optim.Optimizer')] )

Transforms an optimizer constructor (optimizer class) to make it lazy by not initializing the parameters. This makes the optimizer lightweight and usable to create a “real” optimizer once the model has been parallelized.