Accelerate documentation

Utility functions and classes

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Utility functions and classes

Below are a variety of utility functions that 🤗 Accelerate provides, broken down by use-case.

Constants

Constants used throughout 🤗 Accelerate for reference

The following are constants used when utilizing Accelerator.save_state()

utils.MODEL_NAME: "pytorch_model" utils.OPTIMIZER_NAME: "optimizer" utils.RNG_STATE_NAME: "random_states" utils.SCALER_NAME: "scaler.pt utils.SCHEDULER_NAME: "scheduler

The following are constants used when utilizing Accelerator.save_model()

utils.WEIGHTS_NAME: "pytorch_model.bin" utils.SAFE_WEIGHTS_NAME: "model.safetensors" utils.WEIGHTS_INDEX_NAME: "pytorch_model.bin.index.json" utils.SAFE_WEIGHTS_INDEX_NAME: "model.safetensors.index.json"

Data Classes

These are basic dataclasses used throughout 🤗 Accelerate and they can be passed in as parameters.

Standalone

These are standalone dataclasses used for checks, such as the type of distributed system being used

class accelerate.utils.ComputeEnvironment

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

Represents a type of the compute environment.

Values:

  • LOCAL_MACHINE — private/custom cluster hardware.
  • AMAZON_SAGEMAKER — Amazon SageMaker as compute environment.

class accelerate.DistributedType

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

Represents a type of distributed environment.

Values:

  • NO — Not a distributed environment, just a single process.
  • MULTI_CPU — Distributed on multiple CPU nodes.
  • MULTI_GPU — Distributed on multiple GPUs.
  • MULTI_MLU — Distributed on multiple MLUs.
  • MULTI_SDAA — Distributed on multiple SDAAs.
  • MULTI_MUSA — Distributed on multiple MUSAs.
  • MULTI_NPU — Distributed on multiple NPUs.
  • MULTI_XPU — Distributed on multiple XPUs.
  • MULTI_HPU — Distributed on multiple HPUs.
  • DEEPSPEED — Using DeepSpeed.
  • XLA — Using TorchXLA.

class accelerate.utils.DynamoBackend

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

Represents a dynamo backend (see https://pytorch.org/docs/stable/torch.compiler.html).

Values:

  • NO — Do not use torch dynamo.
  • EAGER — Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues.
  • AOT_EAGER — Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd’s extracted forward and backward graphs. This is useful for debugging, and unlikely to give speedups.
  • INDUCTOR — Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels. Read more
  • AOT_TS_NVFUSER — nvFuser with AotAutograd/TorchScript. Read more
  • NVPRIMS_NVFUSER — nvFuser with PrimTorch. Read more
  • CUDAGRAPHS — cudagraphs with AotAutograd. Read more
  • OFI — Uses Torchscript optimize_for_inference. Inference only. Read more
  • FX2TRT — Uses Nvidia TensorRT for inference optimizations. Inference only. Read more
  • ONNXRT — Uses ONNXRT for inference on CPU/GPU. Inference only. Read more
  • TENSORRT — Uses ONNXRT to run TensorRT for inference optimizations. Read more
  • AOT_TORCHXLA_TRACE_ONCE — Uses Pytorch/XLA with TorchDynamo optimization, for training. Read more
  • TORCHXLA_TRACE_ONCE — Uses Pytorch/XLA with TorchDynamo optimization, for inference. Read more
  • IPEX — Uses IPEX for inference on CPU. Inference only. Read more.
  • TVM — Uses Apach TVM for inference optimizations. Read more
  • HPU_BACKEND — Uses HPU backend for inference optimizations.

class accelerate.utils.LoggerType

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

Represents a type of supported experiment tracker

Values:

  • ALL — all available trackers in the environment that are supported
  • TENSORBOARD — TensorBoard as an experiment tracker
  • WANDB — wandb as an experiment tracker
  • COMETML — comet_ml as an experiment tracker
  • DVCLIVE — dvclive as an experiment tracker

class accelerate.utils.PrecisionType

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

Represents a type of precision used on floating point values

Values:

  • NO — using full precision (FP32)
  • FP16 — using half precision
  • BF16 — using brain floating point precision

class accelerate.utils.RNGType

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

An enumeration.

class accelerate.utils.SageMakerDistributedType

< >

( valuenames = Nonemodule = Nonequalname = Nonetype = Nonestart = 1 )

Represents a type of distributed environment.

Values:

  • NO — Not a distributed environment, just a single process.
  • DATA_PARALLEL — using sagemaker distributed data parallelism.
  • MODEL_PARALLEL — using sagemaker distributed model parallelism.

Kwargs

These are configurable arguments for specific interactions throughout the PyTorch ecosystem that Accelerate handles under the hood.

class accelerate.AutocastKwargs

< >

( enabled: bool = Truecache_enabled: bool = None )

Use this object in your Accelerator to customize how torch.autocast behaves. Please refer to the documentation of this context manager for more information on each argument.

Example:

from accelerate import Accelerator
from accelerate.utils import AutocastKwargs

kwargs = AutocastKwargs(cache_enabled=True)
accelerator = Accelerator(kwargs_handlers=[kwargs])

class accelerate.DistributedDataParallelKwargs

< >

( dim: int = 0broadcast_buffers: bool = Truebucket_cap_mb: int = 25find_unused_parameters: bool = Falsecheck_reduction: bool = Falsegradient_as_bucket_view: bool = Falsestatic_graph: bool = Falsecomm_hook: DDPCommunicationHookType = <DDPCommunicationHookType.NO: 'no'>comm_wrapper: typing.Literal[<DDPCommunicationHookType.NO: 'no'>, <DDPCommunicationHookType.FP16: 'fp16'>, <DDPCommunicationHookType.BF16: 'bf16'>] = <DDPCommunicationHookType.NO: 'no'>comm_state_option: dict = <factory> )

Use this object in your Accelerator to customize how your model is wrapped in a torch.nn.parallel.DistributedDataParallel. Please refer to the documentation of this wrapper for more information on each argument.

gradient_as_bucket_view is only available in PyTorch 1.7.0 and later versions.

static_graph is only available in PyTorch 1.11.0 and later versions.

Example:

from accelerate import Accelerator
from accelerate.utils import DistributedDataParallelKwargs

kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[kwargs])

class accelerate.utils.FP8RecipeKwargs

< >

( opt_level: typing.Literal['O1', 'O2'] = Noneuse_autocast_during_eval: bool = Nonemargin: int = Noneinterval: int = Nonefp8_format: typing.Literal['E4M3', 'HYBRID'] = Noneamax_history_len: int = Noneamax_compute_algo: typing.Literal['max', 'most_recent'] = Noneoverride_linear_precision: tuple = Nonebackend: typing.Literal['MSAMP', 'TE'] = None )

Deprecated. Please use one of the proper FP8 recipe kwargs classes such as TERecipeKwargs or MSAMPRecipeKwargs instead.

class accelerate.GradScalerKwargs

< >

( init_scale: float = 65536.0growth_factor: float = 2.0backoff_factor: float = 0.5growth_interval: int = 2000enabled: bool = True )

Use this object in your Accelerator to customize the behavior of mixed precision, specifically how the torch.cuda.amp.GradScaler used is created. Please refer to the documentation of this scaler for more information on each argument.

GradScaler is only available in PyTorch 1.5.0 and later versions.

Example:

from accelerate import Accelerator
from accelerate.utils import GradScalerKwargs

kwargs = GradScalerKwargs(backoff_factor=0.25)
accelerator = Accelerator(kwargs_handlers=[kwargs])

class accelerate.InitProcessGroupKwargs

< >

( backend: typing.Optional[str] = 'nccl'init_method: typing.Optional[str] = Nonetimeout: typing.Optional[datetime.timedelta] = None )

Use this object in your Accelerator to customize the initialization of the distributed processes. Please refer to the documentation of this method for more information on each argument.

Note: If timeout is set to None, the default will be based upon how backend is set.

from datetime import timedelta
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=800))
accelerator = Accelerator(kwargs_handlers=[kwargs])

class accelerate.utils.KwargsHandler

< >

( )

Internal mixin that implements a to_kwargs() method for a dataclass.

to_kwargs

< >

( )

Returns a dictionary containing the attributes with values different from the default of this class.

Plugins

These are plugins that can be passed to the Accelerator object. While they are defined elsewhere in the documentation, for convenience all of them are available to see here:

class accelerate.DeepSpeedPlugin

< >

( hf_ds_config: typing.Any = Nonegradient_accumulation_steps: int = Nonegradient_clipping: float = Nonezero_stage: int = Noneis_train_batch_min: bool = Trueoffload_optimizer_device: str = Noneoffload_param_device: str = Noneoffload_optimizer_nvme_path: str = Noneoffload_param_nvme_path: str = Nonezero3_init_flag: bool = Nonezero3_save_16bit_model: bool = Nonetransformer_moe_cls_names: str = Noneenable_msamp: bool = Nonemsamp_opt_level: typing.Optional[typing.Literal['O1', 'O2']] = None )

Parameters

  • hf_ds_config (Any, defaults to None) — Path to DeepSpeed config file or dict or an object of class accelerate.utils.deepspeed.HfDeepSpeedConfig.
  • gradient_accumulation_steps (int, defaults to None) — Number of steps to accumulate gradients before updating optimizer states. If not set, will use the value from the Accelerator directly.
  • gradient_clipping (float, defaults to None) — Enable gradient clipping with value.
  • zero_stage (int, defaults to None) — Possible options are 0, 1, 2, 3. Default will be taken from environment variable.
  • is_train_batch_min (bool, defaults to True) — If both train & eval dataloaders are specified, this will decide the train_batch_size.
  • offload_optimizer_device (str, defaults to None) — Possible options are none|cpu|nvme. Only applicable with ZeRO Stages 2 and 3.
  • offload_param_device (str, defaults to None) — Possible options are none|cpu|nvme. Only applicable with ZeRO Stage 3.
  • offload_optimizer_nvme_path (str, defaults to None) — Possible options are /nvme|/local_nvme. Only applicable with ZeRO Stage 3.
  • offload_param_nvme_path (str, defaults to None) — Possible options are /nvme|/local_nvme. Only applicable with ZeRO Stage 3.
  • zero3_init_flag (bool, defaults to None) — Flag to indicate whether to save 16-bit model. Only applicable with ZeRO Stage-3.
  • zero3_save_16bit_model (bool, defaults to None) — Flag to indicate whether to save 16-bit model. Only applicable with ZeRO Stage-3.
  • transformer_moe_cls_names (str, defaults to None) — Comma-separated list of Transformers MoE layer class names (case-sensitive). For example, MixtralSparseMoeBlock, Qwen2MoeSparseMoeBlock, JetMoEAttention, JetMoEBlock, etc.
  • enable_msamp (bool, defaults to None) — Flag to indicate whether to enable MS-AMP backend for FP8 training.
  • msasmp_opt_level (Optional[Literal["O1", "O2"]], defaults to None) — Optimization level for MS-AMP (defaults to ‘O1’). Only applicable if enable_msamp is True. Should be one of [‘O1’ or ‘O2’].

This plugin is used to integrate DeepSpeed.

deepspeed_config_process

< >

( prefix = ''mismatches = Noneconfig = Nonemust_match = True**kwargs )

Process the DeepSpeed config with the values from the kwargs.

select

< >

( _from_accelerator_state: bool = False )

Sets the HfDeepSpeedWeakref to use the current deepspeed plugin configuration

class accelerate.FullyShardedDataParallelPlugin

< >

( fsdp_version: int = Nonesharding_strategy: typing.Union[str, ForwardRef('torch.distributed.fsdp.ShardingStrategy')] = Nonereshard_after_forward: typing.Union[str, ForwardRef('torch.distributed.fsdp.ShardingStrategy'), bool] = Nonebackward_prefetch: typing.Union[str, ForwardRef('torch.distributed.fsdp.BackwardPrefetch'), NoneType] = Nonemixed_precision_policy: typing.Union[dict, ForwardRef('torch.distributed.fsdp.MixedPrecision'), ForwardRef('torch.distributed.fsdp.MixedPrecisionPolicy'), NoneType] = Noneauto_wrap_policy: typing.Union[typing.Callable, typing.Literal['transformer_based_wrap', 'size_based_wrap', 'no_wrap'], NoneType] = Nonecpu_offload: typing.Union[bool, ForwardRef('torch.distributed.fsdp.CPUOffload'), ForwardRef('torch.distributed.fsdp.CPUOffloadPolicy')] = Noneignored_modules: typing.Optional[collections.abc.Iterable[torch.nn.modules.module.Module]] = Nonestate_dict_type: typing.Union[str, ForwardRef('torch.distributed.fsdp.StateDictType')] = Nonestate_dict_config: typing.Union[ForwardRef('torch.distributed.fsdp.FullStateDictConfig'), ForwardRef('torch.distributed.fsdp.ShardedStateDictConfig'), NoneType] = Noneoptim_state_dict_config: typing.Union[ForwardRef('torch.distributed.fsdp.FullOptimStateDictConfig'), ForwardRef('torch.distributed.fsdp.ShardedOptimStateDictConfig'), NoneType] = Nonelimit_all_gathers: bool = Trueuse_orig_params: typing.Optional[bool] = Noneparam_init_fn: typing.Optional[typing.Callable[[torch.nn.modules.module.Module], NoneType]] = Nonesync_module_states: typing.Optional[bool] = Noneforward_prefetch: bool = Noneactivation_checkpointing: bool = Nonecpu_ram_efficient_loading: bool = Nonetransformer_cls_names_to_wrap: typing.Optional[list[str]] = Nonemin_num_params: typing.Optional[int] = None )

Parameters

  • fsdp_version (int, defaults to 1) — The version of FSDP to use. Defaults to 1. If set to 2, launcher expects the config to be converted to FSDP2 format.
  • sharding_strategy (Union[str, torch.distributed.fsdp.ShardingStrategy], defaults to 'FULL_SHARD') — Sharding strategy to use. Should be either a str or an instance of torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy. Is deprecated in favor of reshard_after_forward.
  • reshard_after_forward (Union[str, torch.distributed.fsdp.ShardingStrategy, bool], defaults to 'FULL_SHARD' for fsdp_version=1 and True for fsdp_version=2) — Sharding strategy to use. Should be a bool if fsdp_version is set to 2 else a str or an instance of torch.distributed.fsdp.fully_sharded_data_parallel.ShardingStrategy.
  • backward_prefetch (Union[str, torch.distributed.fsdp.BackwardPrefetch], defaults to 'NO_PREFETCH') — Backward prefetch strategy to use. Should be either a str or an instance of torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch.
  • mixed_precision_policy (Optional[Union[dict, torch.distributed.fsdp.MixedPrecision, torch.distributed.fsdp.MixedPrecisionPolicy]], defaults to None) — A config to enable mixed precision training with FullyShardedDataParallel. If passing in a dict, it should have the following keys: param_dtype, reduce_dtype, and buffer_dtype, can be an instance of torch.distributed.fsdp.MixedPrecisionPolicy if fsdp_version is set to 2.
  • auto_wrap_policy (Optional(Union[Callable, Literal["transformer_based_wrap", "size_based_wrap", "no_wrap"]]), defaults to NO_WRAP) -- A callable or string specifying a policy to recursively wrap layers with FSDP. If a string, it must be one of transformer_based_wrap, size_based_wrap, or no_wrap. See torch.distributed.fsdp.wrap.size_based_wrap_policy` for a direction on what it should look like.
  • cpu_offload (Union[bool, torch.distributed.fsdp.CPUOffload, torch.distributed.fsdp.CPUOffloadPolicy], defaults to False) — Whether to offload parameters to CPU. Should be either a bool or an instance of torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffload or torch.distributed.fsdp.fully_sharded_data_parallel.CPUOffloadPolicy if fsdp_version is set to 2.
  • ignored_modules (Optional[Iterable[torch.nn.Module]], defaults to None) — A list of modules to ignore when wrapping with FSDP.
  • state_dict_type (Union[str, torch.distributed.fsdp.StateDictType], defaults to 'FULL_STATE_DICT') — State dict type to use. If a string, it must be one of full_state_dict, local_state_dict, or sharded_state_dict.
  • state_dict_config (Optional[Union[torch.distributed.fsdp.FullStateDictConfig, torch.distributed.fsdp.ShardedStateDictConfig], defaults to None) — State dict config to use. Is determined based on the state_dict_type if not passed in.
  • optim_state_dict_config (Optional[Union[torch.distributed.fsdp.FullOptimStateDictConfig, torch.distributed.fsdp.ShardedOptimStateDictConfig], defaults to None) — Optim state dict config to use. Is determined based on the state_dict_type if not passed in.
  • limit_all_gathers (bool, defaults to True) — Whether to have FSDP explicitly synchronizes the CPU thread to prevent too many in-flight all-gathers. This bool only affects the sharded strategies that schedule all-gathers. Enabling this can help lower the number of CUDA malloc retries.
  • use_orig_params (bool, defaults to False) — Whether to use the original parameters for the optimizer.
  • param_init_fn (Optional[Callable[[torch.nn.Module], None], defaults to None) — A Callable[torch.nn.Module] -> None that specifies how modules that are currently on the meta device should be initialized onto an actual device. Only applicable when sync_module_states is True. By default is a lambda which calls to_empty on the module.
  • sync_module_states (bool, defaults to False) — Whether each individually wrapped FSDP unit should broadcast module parameters from rank 0 to ensure they are the same across all ranks after initialization. Defaults to False unless cpu_ram_efficient_loading is True, then will be forcibly enabled.
  • forward_prefetch (bool, defaults to False) — Whether to have FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. only use with Static graphs.
  • activation_checkpointing (bool, defaults to False) — A technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage.
  • cpu_ram_efficient_loading (bool, defaults to None) — If True, only the first process loads the pretrained model checkoint while all other processes have empty weights. Only applicable for Transformers. When using this, sync_module_states needs to be True.
  • transformer_cls_names_to_wrap (Optional[List[str]], defaults to None) — A list of transformer layer class names to wrap. Only applicable when auto_wrap_policy is transformer_based_wrap.
  • min_num_params (Optional[int], defaults to None) — The minimum number of parameters a module must have to be wrapped. Only applicable when auto_wrap_policy is size_based_wrap.

This plugin is used to enable fully sharded data parallelism.

set_auto_wrap_policy

< >

( model )

Given model, creates an auto_wrap_policy baesd on the passed in policy and if we can use the transformer_cls_to_wrap

set_mixed_precision

< >

( mixed_precisionbuffer_autocast = Falseoverride = False )

Sets the mixed precision policy for FSDP

set_state_dict_type

< >

( state_dict_type = None )

Set the state dict config based on the StateDictType.

validate_mixed_precision_policy

< >

( )

Validates the mixed precision policy, abstracted away to not bring in the imports if not needed.

class accelerate.utils.GradientAccumulationPlugin

< >

( num_steps: int = Noneadjust_scheduler: bool = Truesync_with_dataloader: bool = Truesync_each_batch: bool = False )

Parameters

  • num_steps (int) — The number of steps to accumulate gradients for.
  • adjust_scheduler (bool, optional, defaults to True) — Whether to adjust the scheduler steps to account for the number of steps being accumulated. Should be True if the used scheduler was not adjusted for gradient accumulation.
  • sync_with_dataloader (bool, optional, defaults to True) — Whether to synchronize setting the gradients when at the end of the dataloader.
  • sync_each_batch (bool, optional) — Whether to synchronize setting the gradients at each data batch. Seting to True may reduce memory requirements when using gradient accumulation with distributed training, at expense of speed.

A plugin to configure gradient accumulation behavior. You can only pass one of gradient_accumulation_plugin or gradient_accumulation_steps to Accelerator. Passing both raises an error.

Example:

from accelerate.utils import GradientAccumulationPlugin

gradient_accumulation_plugin = GradientAccumulationPlugin(num_steps=2)
accelerator = Accelerator(gradient_accumulation_plugin=gradient_accumulation_plugin)

class accelerate.utils.MegatronLMPlugin

< >

( tp_degree: int = Nonepp_degree: int = Nonenum_micro_batches: int = Nonegradient_clipping: float = Nonesequence_parallelism: bool = Nonerecompute_activations: bool = Noneuse_distributed_optimizer: bool = Nonepipeline_model_parallel_split_rank: int = Nonenum_layers_per_virtual_pipeline_stage: int = Noneis_train_batch_min: str = Truetrain_iters: int = Nonetrain_samples: int = Noneweight_decay_incr_style: str = 'constant'start_weight_decay: float = Noneend_weight_decay: float = Nonelr_decay_style: str = 'linear'lr_decay_iters: int = Nonelr_decay_samples: int = Nonelr_warmup_iters: int = Nonelr_warmup_samples: int = Nonelr_warmup_fraction: float = Nonemin_lr: float = 0consumed_samples: list = Noneno_wd_decay_cond: typing.Optional[typing.Callable] = Nonescale_lr_cond: typing.Optional[typing.Callable] = Nonelr_mult: float = 1.0megatron_dataset_flag: bool = Falseseq_length: int = Noneencoder_seq_length: int = Nonedecoder_seq_length: int = Nonetensorboard_dir: str = Noneset_all_logging_options: bool = Falseeval_iters: int = 100eval_interval: int = 1000return_logits: bool = Falsecustom_train_step_class: typing.Optional[typing.Any] = Nonecustom_train_step_kwargs: typing.Optional[dict[str, typing.Any]] = Nonecustom_model_provider_function: typing.Optional[typing.Callable] = Nonecustom_prepare_model_function: typing.Optional[typing.Callable] = Nonecustom_megatron_datasets_provider_function: typing.Optional[typing.Callable] = Nonecustom_get_batch_function: typing.Optional[typing.Callable] = Nonecustom_loss_function: typing.Optional[typing.Callable] = Noneother_megatron_args: typing.Optional[dict[str, typing.Any]] = None )

Parameters

  • tp_degree (int, defaults to None) — Tensor parallelism degree.
  • pp_degree (int, defaults to None) — Pipeline parallelism degree.
  • num_micro_batches (int, defaults to None) — Number of micro-batches.
  • gradient_clipping (float, defaults to None) — Gradient clipping value based on global L2 Norm (0 to disable).
  • sequence_parallelism (bool, defaults to None) — Enable sequence parallelism.
  • recompute_activations (bool, defaults to None) — Enable selective activation recomputation.
  • use_distributed_optimizr (bool, defaults to None) — Enable distributed optimizer.
  • pipeline_model_parallel_split_rank (int, defaults to None) — Rank where encoder and decoder should be split.
  • num_layers_per_virtual_pipeline_stage (int, defaults to None) — Number of layers per virtual pipeline stage.
  • is_train_batch_min (str, defaults to True) — If both tran & eval dataloaders are specified, this will decide the micro_batch_size.
  • train_iters (int, defaults to None) — Total number of samples to train over all training runs. Note that either train-iters or train-samples should be provided when using MegatronLMDummyScheduler.
  • train_samples (int, defaults to None) — Total number of samples to train over all training runs. Note that either train-iters or train-samples should be provided when using MegatronLMDummyScheduler.
  • weight_decay_incr_style (str, defaults to 'constant') — Weight decay increment function. choices=[“constant”, “linear”, “cosine”].
  • start_weight_decay (float, defaults to None) — Initial weight decay coefficient for L2 regularization.
  • end_weight_decay (float, defaults to None) — End of run weight decay coefficient for L2 regularization.
  • lr_decay_style (str, defaults to 'linear') — Learning rate decay function. choices=[‘constant’, ‘linear’, ‘cosine’].
  • lr_decay_iters (int, defaults to None) — Number of iterations for learning rate decay. If None defaults to train_iters.
  • lr_decay_samples (int, defaults to None) — Number of samples for learning rate decay. If None defaults to train_samples.
  • lr_warmup_iters (int, defaults to None) — Number of iterations to linearly warmup learning rate over.
  • lr_warmup_samples (int, defaults to None) — Number of samples to linearly warmup learning rate over.
  • lr_warmup_fraction (float, defaults to None) — Fraction of lr-warmup-(iters/samples) to linearly warmup learning rate over.
  • min_lr (float, defaults to 0) — Minumum value for learning rate. The scheduler clip values below this threshold.
  • consumed_samples (List, defaults to None) — Number of samples consumed in the same order as the dataloaders to accelerator.prepare call.
  • no_wd_decay_cond (Optional, defaults to None) — Condition to disable weight decay.
  • scale_lr_cond (Optional, defaults to None) — Condition to scale learning rate.
  • lr_mult (float, defaults to 1.0) — Learning rate multiplier.
  • megatron_dataset_flag (bool, defaults to False) — Whether the format of dataset follows Megatron-LM Indexed/Cached/MemoryMapped format.
  • seq_length (int, defaults to None) — Maximum sequence length to process.
  • encoder_seq_length (int, defaults to None) — Maximum sequence length to process for the encoder.
  • decoder_seq_length (int, defaults to None) — Maximum sequence length to process for the decoder.
  • tensorboard_dir (str, defaults to None) — Path to save tensorboard logs.
  • set_all_logging_options (bool, defaults to False) — Whether to set all logging options.
  • eval_iters (int, defaults to 100) — Number of iterations to run for evaluation validation/test for.
  • eval_interval (int, defaults to 1000) — Interval between running evaluation on validation set.
  • return_logits (bool, defaults to False) — Whether to return logits from the model.
  • custom_train_step_class (Optional, defaults to None) — Custom train step class.
  • custom_train_step_kwargs (Optional, defaults to None) — Custom train step kwargs.
  • custom_model_provider_function (Optional, defaults to None) — Custom model provider function.
  • custom_prepare_model_function (Optional, defaults to None) — Custom prepare model function.
  • custom_megatron_datasets_provider_function (Optional, defaults to None) — Custom megatron train_valid_test datasets provider function.
  • custom_get_batch_function (Optional, defaults to None) — Custom get batch function.
  • custom_loss_function (Optional, defaults to None) — Custom loss function.
  • other_megatron_args (Optional, defaults to None) — Other Megatron-LM arguments. Please refer Megatron-LM.

Plugin for Megatron-LM to enable tensor, pipeline, sequence and data parallelism. Also to enable selective activation recomputation and optimized fused kernels.

class accelerate.utils.TorchDynamoPlugin

< >

( backend: DynamoBackend = Nonemode: str = Nonefullgraph: bool = Nonedynamic: bool = Noneoptions: typing.Any = Nonedisable: bool = False )

Parameters

  • backend (DynamoBackend, defaults to None) — A valid Dynamo backend. See https://pytorch.org/docs/stable/torch.compiler.html for more details.
  • mode (str, defaults to None) — Possible options are ‘default’, ‘reduce-overhead’ or ‘max-autotune’.
  • fullgraph (bool, defaults to None) — Whether it is ok to break model into several subgraphs.
  • dynamic (bool, defaults to None) — Whether to use dynamic shape for tracing.
  • options (Any, defaults to None) — A dictionary of options to pass to the backend.
  • disable (bool, defaults to False) — Turn torch.compile() into a no-op for testing

This plugin is used to compile a model with PyTorch 2.0

Configurations

These are classes which can be configured and passed through to the appropriate integration

class accelerate.utils.BnbQuantizationConfig

< >

( load_in_8bit: bool = Falsellm_int8_threshold: float = 6.0load_in_4bit: bool = Falsebnb_4bit_quant_type: str = 'fp4'bnb_4bit_use_double_quant: bool = Falsebnb_4bit_compute_dtype: str = 'fp16'torch_dtype: dtype = Noneskip_modules: list = Nonekeep_in_fp32_modules: list = None )

Parameters

  • load_in_8bit (bool, defaults to False) — Enable 8bit quantization.
  • llm_int8_threshold (float, defaults to 6.0) — Value of the outliner threshold. Only relevant when load_in_8bit=True.
  • load_in_4_bit (bool, defaults to False) — Enable 4bit quantization.
  • bnb_4bit_quant_type (str, defaults to fp4) — Set the quantization data type in the bnb.nn.Linear4Bit layers. Options are {‘fp4’,‘np4’}.
  • bnb_4bit_use_double_quant (bool, defaults to False) — Enable nested quantization where the quantization constants from the first quantization are quantized again.
  • bnb_4bit_compute_dtype (bool, defaults to fp16) — This sets the computational type which might be different than the input time. For example, inputs might be fp32, but computation can be set to bf16 for speedups. Options are {‘fp32’,‘fp16’,‘bf16’}.
  • torch_dtype (torch.dtype, defaults to None) — This sets the dtype of the remaining non quantized layers. bitsandbytes library suggests to set the value to torch.float16 for 8 bit model and use the same dtype as the compute dtype for 4 bit model.
  • skip_modules (List[str], defaults to None) — An explicit list of the modules that we don’t quantize. The dtype of these modules will be torch_dtype.
  • keep_in_fp32_modules (List, defaults to None) — An explicit list of the modules that we don’t quantize. We keep them in torch.float32.

A plugin to enable BitsAndBytes 4bit and 8bit quantization

class accelerate.DataLoaderConfiguration

< >

( split_batches: bool = Falsedispatch_batches: bool = Noneeven_batches: bool = Trueuse_seedable_sampler: bool = Falsedata_seed: int = Nonenon_blocking: bool = Falseuse_stateful_dataloader: bool = False )

Parameters

  • split_batches (bool, defaults to False) — Whether or not the accelerator should split the batches yielded by the dataloaders across the devices. If True, the actual batch size used will be the same on any kind of distributed processes, but it must be a round multiple of num_processes you are using. If False, actual batch size used will be the one set in your script multiplied by the number of processes.
  • dispatch_batches (bool, defaults to None) — If set to True, the dataloader prepared by the Accelerator is only iterated through on the main process and then the batches are split and broadcast to each process. Will default to True for DataLoader whose underlying dataset is an IterableDataset, False otherwise.
  • even_batches (bool, defaults to True) — If set to True, in cases where the total batch size across all processes does not exactly divide the dataset, samples at the start of the dataset will be duplicated so the batch can be divided equally among all workers.
  • use_seedable_sampler (bool, defaults to False) — Whether or not use a fully seedable random sampler (data_loader.SeedableRandomSampler). Ensures training results are fully reproducable using a different sampling technique. While seed-to-seed results may differ, on average the differences are neglible when using multiple different seeds to compare. Should also be ran with set_seed() for the best results.
  • data_seed (int, defaults to None) — The seed to use for the underlying generator when using use_seedable_sampler. If None, the generator will use the current default seed from torch.
  • non_blocking (bool, defaults to False) — If set to True, the dataloader prepared by the Accelerator will utilize non-blocking host-to-device transfers, allowing for better overlap between dataloader communication and computation. Recommended that the prepared dataloader has pin_memory set to True to work properly.
  • use_stateful_dataloader (bool, defaults to False) — If set to True, the dataloader prepared by the Accelerator will be backed by torchdata.StatefulDataLoader. This requires torchdata version 0.8.0 or higher that supports StatefulDataLoader to be installed.

Configuration for dataloader-related items when calling accelerator.prepare.

class accelerate.utils.ProjectConfiguration

< >

( project_dir: str = Nonelogging_dir: str = Noneautomatic_checkpoint_naming: bool = Falsetotal_limit: int = Noneiteration: int = 0save_on_each_node: bool = False )

Parameters

  • project_dir (str, defaults to None) — A path to a directory for storing data.
  • logging_dir (str, defaults to None) — A path to a directory for storing logs of locally-compatible loggers. If None, defaults to project_dir.
  • automatic_checkpoint_naming (bool, defaults to False) — Whether saved states should be automatically iteratively named.
  • total_limit (int, defaults to None) — The maximum number of total saved states to keep.
  • iteration (int, defaults to 0) — The current save iteration.
  • save_on_each_node (bool, defaults to False) — When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one.

Configuration for the Accelerator object based on inner-project needs.

set_directories

< >

( project_dir: str = None )

Sets self.project_dir and self.logging_dir to the appropriate values.

Environmental Variables

These are environmental variables that can be enabled for different use cases

  • ACCELERATE_DEBUG_MODE (str): Whether to run accelerate in debug mode. More info available here.

Data Manipulation and Operations

These include data operations that mimic the same torch ops but can be used on distributed processes.

accelerate.utils.broadcast

< >

( tensorfrom_process: int = 0 )

Parameters

  • tensor (nested list/tuple/dictionary of torch.Tensor) — The data to gather.
  • from_process (int, optional, defaults to 0) — The process from which to send the data

Recursively broadcast tensor in a nested list/tuple/dictionary of tensors to all devices.

accelerate.utils.broadcast_object_list

< >

( object_listfrom_process: int = 0 )

Parameters

  • object_list (list of picklable objects) — The list of objects to broadcast. This list will be modified inplace.
  • from_process (int, optional, defaults to 0) — The process from which to send the data.

Broadcast a list of picklable objects form one process to the others.

accelerate.utils.concatenate

< >

( datadim = 0 )

Parameters

  • data (nested list/tuple/dictionary of lists of tensors torch.Tensor) — The data to concatenate.
  • dim (int, optional, defaults to 0) — The dimension on which to concatenate.

Recursively concatenate the tensors in a nested list/tuple/dictionary of lists of tensors with the same shape.

accelerate.utils.convert_outputs_to_fp32

< >

( model_forward )

accelerate.utils.convert_to_fp32

< >

( tensor )

Parameters

  • tensor (nested list/tuple/dictionary of torch.Tensor) — The data to convert from FP16/BF16 to FP32.

Recursively converts the elements nested list/tuple/dictionary of tensors in FP16/BF16 precision to FP32.

accelerate.utils.gather

< >

( tensor )

Parameters

  • tensor (nested list/tuple/dictionary of torch.Tensor) — The data to gather.

Recursively gather tensor in a nested list/tuple/dictionary of tensors from all devices.

accelerate.utils.gather_object

< >

( object: typing.Any )

Parameters

  • object (nested list/tuple/dictionary of picklable object) — The data to gather.

Recursively gather object in a nested list/tuple/dictionary of objects from all devices.

accelerate.utils.get_grad_scaler

< >

( distributed_type: DistributedType = None**kwargs )

Parameters

  • distributed_type (DistributedType, optional, defaults to None) — The type of distributed environment.
  • kwargs — Additional arguments for the utilized GradScaler constructor.

A generic helper which will initialize the correct GradScaler implementation based on the environment and return it.

accelerate.utils.get_mixed_precision_context_manager

< >

( native_amp: bool = Falseautocast_kwargs: AutocastKwargs = None )

Parameters

  • native_amp (bool, optional, defaults to False) — Whether mixed precision is actually enabled.
  • cache_enabled (bool, optional, defaults to True) — Whether the weight cache inside autocast should be enabled.

Return a context manager for autocasting mixed precision

accelerate.utils.listify

< >

( data )

Parameters

  • data (nested list/tuple/dictionary of torch.Tensor) — The data from which to convert to regular numbers.

Recursively finds tensors in a nested list/tuple/dictionary and converts them to a list of numbers.

accelerate.utils.pad_across_processes

< >

( tensordim = 0pad_index = 0pad_first = False )

Parameters

  • tensor (nested list/tuple/dictionary of torch.Tensor) — The data to gather.
  • dim (int, optional, defaults to 0) — The dimension on which to pad.
  • pad_index (int, optional, defaults to 0) — The value with which to pad.
  • pad_first (bool, optional, defaults to False) — Whether to pad at the beginning or the end.

Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so they can safely be gathered.

accelerate.utils.recursively_apply

< >

( funcdata*argstest_type = <function is_torch_tensor at 0x7eff010bf010>error_on_other_type = False**kwargs )

Parameters

  • func (callable) — The function to recursively apply.
  • data (nested list/tuple/dictionary of main_type) — The data on which to apply func
  • *args — Positional arguments that will be passed to func when applied on the unpacked data.
  • main_type (type, optional, defaults to torch.Tensor) — The base type of the objects to which apply func.
  • error_on_other_type (bool, optional, defaults to False) — Whether to return an error or not if after unpacking data, we get on an object that is not of type main_type. If False, the function will leave objects of types different than main_type unchanged.
  • **kwargs (additional keyword arguments, optional) — Keyword arguments that will be passed to func when applied on the unpacked data.

Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type.

accelerate.utils.reduce

< >

( tensorreduction = 'mean'scale = 1.0 )

Parameters

  • tensor (nested list/tuple/dictionary of torch.Tensor) — The data to reduce.
  • reduction (str, optional, defaults to "mean") — A reduction method. Can be of “mean”, “sum”, or “none”
  • scale (float, optional) — A default scaling value to be applied after the reduce, only valied on XLA.

Recursively reduce the tensors in a nested list/tuple/dictionary of lists of tensors across all processes by the mean of a given operation.

accelerate.utils.send_to_device

< >

( tensordevicenon_blocking = Falseskip_keys = None )

Parameters

  • tensor (nested list/tuple/dictionary of torch.Tensor) — The data to send to a given device.
  • device (torch.device) — The device to send the data to.

Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.

accelerate.utils.slice_tensors

< >

( datatensor_sliceprocess_index = Nonenum_processes = None )

Parameters

  • data (nested list/tuple/dictionary of torch.Tensor) — The data to slice.
  • tensor_slice (slice) — The slice to take.

Recursively takes a slice in a nested list/tuple/dictionary of tensors.

Environment Checks

These functionalities check the state of the current working environment including information about the operating system itself, what it can support, and if particular dependencies are installed.

accelerate.utils.is_bf16_available

< >

( ignore_tpu = False )

Checks if bf16 is supported, optionally ignoring the TPU

accelerate.utils.is_ipex_available

< >

( )

Checks if ipex is installed.

accelerate.utils.is_mps_available

< >

( min_version = '1.12' )

Checks if MPS device is available. The minimum version required is 1.12.

accelerate.utils.is_npu_available

( check_device = False )

Checks if torch_npu is installed and potentially if a NPU is in the environment

accelerate.utils.is_torch_version

< >

( operation: strversion: str )

Parameters

  • operation (str) — A string representation of an operator, such as ">" or "<="
  • version (str) — A string version of PyTorch

Compares the current PyTorch version to a given reference with an operation.

accelerate.utils.is_torch_xla_available

( check_is_tpu = Falsecheck_is_gpu = False )

Check if torch_xla is available. To train a native pytorch job in an environment with torch xla installed, set the USE_TORCH_XLA to false.

accelerate.utils.is_xpu_available

( check_device = False )

Checks if XPU acceleration is available either via intel_extension_for_pytorch or via stock PyTorch (>=2.4) and potentially if a XPU is in the environment

Environment Manipulation

accelerate.utils.patch_environment

< >

( **kwargs )

A context manager that will add each keyword argument passed to os.environ and remove them when exiting.

Will convert the values in kwargs to strings and upper-case all the keys.

Example:

>>> import os
>>> from accelerate.utils import patch_environment

>>> with patch_environment(FOO="bar"):
...     print(os.environ["FOO"])  # prints "bar"
>>> print(os.environ["FOO"])  # raises KeyError

accelerate.utils.clear_environment

< >

( )

A context manager that will temporarily clear environment variables.

When this context exits, the previous environment variables will be back.

Example:

>>> import os
>>> from accelerate.utils import clear_environment

>>> os.environ["FOO"] = "bar"
>>> with clear_environment():
...     print(os.environ)
...     os.environ["FOO"] = "new_bar"
...     print(os.environ["FOO"])
{}
new_bar

>>> print(os.environ["FOO"])
bar

accelerate.commands.config.default.write_basic_config

< >

( mixed_precision = 'no'save_location: str = '/github/home/.cache/huggingface/accelerate/default_config.yaml' )

Parameters

  • mixed_precision (str, optional, defaults to “no”) — Mixed Precision to use. Should be one of “no”, “fp16”, or “bf16”
  • save_location (str, optional, defaults to default_json_config_file) — Optional custom save location. Should be passed to --config_file when using accelerate launch. Default location is inside the huggingface cache folder (~/.cache/huggingface) but can be overriden by setting the HF_HOME environmental variable, followed by accelerate/default_config.yaml.

Creates and saves a basic cluster config to be used on a local machine with potentially multiple GPUs. Will also set CPU if it is a CPU-only machine.

When setting up 🤗 Accelerate for the first time, rather than running accelerate config [~utils.write_basic_config] can be used as an alternative for quick configuration.

accelerate.utils.set_numa_affinity

( local_process_index: intverbose: typing.Optional[bool] = None )

Parameters

  • local_process_index (int) — The index of the current process on the current server.
  • verbose (bool, optional) — Whether to print the new cpu cores assignment for each process. If ACCELERATE_DEBUG_MODE is enabled, will default to True.

Assigns the current process to a specific NUMA node. Ideally most efficient when having at least 2 cpus per node.

This result is cached between calls. If you want to override it, please use accelerate.utils.environment.override_numa_afifnity.

accelerate.utils.environment.override_numa_affinity

< >

( local_process_index: intverbose: typing.Optional[bool] = None )

Parameters

  • local_process_index (int) — The index of the current process on the current server.
  • verbose (bool, optional) — Whether to log out the assignment of each CPU. If ACCELERATE_DEBUG_MODE is enabled, will default to True.

Overrides whatever NUMA affinity is set for the current process. This is very taxing and requires recalculating the affinity to set, ideally you should use utils.environment.set_numa_affinity instead.

accelerate.utils.purge_accelerate_environment

< >

( func_or_cls )

Decorator to clean up accelerate environment variables set by the decorated class or function.

In some circumstances, calling certain classes or functions can result in accelerate env vars being set and not being cleaned up afterwards. As an example, when calling:

TrainingArguments(fp16=True, …)

The following env var will be set:

ACCELERATE_MIXED_PRECISION=fp16

This can affect subsequent code, since the env var takes precedence over TrainingArguments(fp16=False). This is especially relevant for unit testing, where we want to avoid the individual tests to have side effects on one another. Decorate the unit test function or whole class with this decorator to ensure that after each test, the env vars are cleaned up. This works for both unittest.TestCase and normal classes (pytest); it also works when decorating the parent class.

Memory

accelerate.find_executable_batch_size

< >

( function: <built-in function callable> = Nonestarting_batch_size: int = 128 )

Parameters

  • function (callable, optional) — A function to wrap
  • starting_batch_size (int, optional) — The batch size to try and fit into memory

A basic decorator that will try to execute function. If it fails from exceptions related to out-of-memory or CUDNN, the batch size is cut in half and passed to function

function must take in a batch_size parameter as its first argument.

Example:

>>> from accelerate.utils import find_executable_batch_size


>>> @find_executable_batch_size(starting_batch_size=128)
... def train(batch_size, model, optimizer):
...     ...


>>> train(model, optimizer)

Modeling

These utilities relate to interacting with PyTorch models

accelerate.utils.calculate_maximum_sizes

< >

( model: Module )

Computes the total size of the model and its largest layer

accelerate.utils.compute_module_sizes

< >

( model: Moduledtype: typing.Union[torch.device, str, NoneType] = Nonespecial_dtypes: typing.Optional[dict[str, typing.Union[str, torch.device]]] = Nonebuffers_only: bool = False )

Compute the size of each submodule of a given model.

accelerate.utils.extract_model_from_parallel

< >

( modelkeep_fp32_wrapper: bool = Truekeep_torch_compile: bool = Truerecursive: bool = False ) torch.nn.Module

Parameters

  • model (torch.nn.Module) — The model to extract.
  • keep_fp32_wrapper (bool, optional) — Whether to remove mixed precision hooks from the model.
  • keep_torch_compile (bool, optional) — Whether to unwrap compiled model.
  • recursive (bool, optional, defaults to False) — Whether to recursively extract all cases of module.module from model as well as unwrap child sublayers recursively, not just the top-level distributed containers.

Returns

torch.nn.Module

The extracted model.

Extract a model from its distributed containers.

accelerate.utils.get_balanced_memory

< >

( model: Modulemax_memory: typing.Optional[dict[typing.Union[int, str], typing.Union[int, str]]] = Noneno_split_module_classes: typing.Optional[list[str]] = Nonedtype: typing.Union[str, torch.dtype, NoneType] = Nonespecial_dtypes: typing.Optional[dict[str, typing.Union[str, torch.device]]] = Nonelow_zero: bool = False )

Parameters

  • model (torch.nn.Module) — The model to analyze.
  • max_memory (Dict, optional) — A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset. Example: max_memory={0: "1GB"}.
  • no_split_module_classes (List[str], optional) — A list of layer class names that should never be split across device (for instance any layer that has a residual connection).
  • dtype (str or torch.dtype, optional) — If provided, the weights will be converted to that type when loaded.
  • special_dtypes (Dict[str, Union[str, torch.device]], optional) — If provided, special dtypes to consider for some specific weights (will override dtype used as default for all weights).
  • low_zero (bool, optional) — Minimizes the number of weights on GPU 0, which is convenient when it’s used for other operations (like the Transformers generate function).

Compute a max_memory dictionary for infer_auto_device_map() that will balance the use of each available GPU.

All computation is done analyzing sizes and dtypes of the model parameters. As a result, the model can be on the meta device (as it would if initialized within the init_empty_weights context manager).

accelerate.utils.get_max_layer_size

< >

( modules: listmodule_sizes: dictno_split_module_classes: list ) Tuple[int, List[str]]

Parameters

  • modules (List[Tuple[str, torch.nn.Module]]) — The list of named modules where we want to determine the maximum layer size.
  • module_sizes (Dict[str, int]) — A dictionary mapping each layer name to its size (as generated by compute_module_sizes).
  • no_split_module_classes (List[str]) — A list of class names for layers we don’t want to be split.

Returns

Tuple[int, List[str]]

The maximum size of a layer with the list of layer names realizing that maximum size.

Utility function that will scan a list of named modules and return the maximum size used by one full layer. The definition of a layer being:

  • a module with no direct children (just parameters and buffers)
  • a module whose class name is in the list no_split_module_classes

accelerate.infer_auto_device_map

< >

( model: Modulemax_memory: typing.Optional[dict[typing.Union[int, str], typing.Union[int, str]]] = Noneno_split_module_classes: typing.Optional[list[str]] = Nonedtype: typing.Union[str, torch.dtype, NoneType] = Nonespecial_dtypes: typing.Optional[dict[str, typing.Union[str, torch.dtype]]] = Noneverbose: bool = Falseclean_result: bool = Trueoffload_buffers: bool = Falsefallback_allocation: bool = False )

Parameters

  • model (torch.nn.Module) — The model to analyze.
  • max_memory (Dict, optional) — A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset. Example: max_memory={0: "1GB"}.
  • no_split_module_classes (List[str], optional) — A list of layer class names that should never be split across device (for instance any layer that has a residual connection).
  • dtype (str or torch.dtype, optional) — If provided, the weights will be converted to that type when loaded.
  • special_dtypes (Dict[str, Union[str, torch.device]], optional) — If provided, special dtypes to consider for some specific weights (will override dtype used as default for all weights).
  • verbose (bool, optional, defaults to False) — Whether or not to provide debugging statements as the function builds the device_map.
  • clean_result (bool, optional, defaults to True) — Clean the resulting device_map by grouping all submodules that go on the same device together.
  • offload_buffers (bool, optional, defaults to False) — In the layers that are offloaded on the CPU or the hard drive, whether or not to offload the buffers as well as the parameters.
  • fallback_allocation (bool, optional, defaults to False) — When regular allocation fails, try to allocate a module that fits in the size limit using BFS.

Compute a device map for a given model giving priority to GPUs, then offload on CPU and finally offload to disk, such that:

  • we don’t exceed the memory available of any of the GPU.
  • if offload to the CPU is needed, there is always room left on GPU 0 to put back the layer offloaded on CPU that has the largest size.
  • if offload to the CPU is needed,we don’t exceed the RAM available on the CPU.
  • if offload to the disk is needed, there is always room left on the CPU to put back the layer offloaded on disk that has the largest size.

All computation is done analyzing sizes and dtypes of the model parameters. As a result, the model can be on the meta device (as it would if initialized within the init_empty_weights context manager).

accelerate.load_checkpoint_in_model

< >

( model: Modulecheckpoint: typing.Union[str, os.PathLike]device_map: typing.Optional[dict[str, typing.Union[int, str, torch.device]]] = Noneoffload_folder: typing.Union[str, os.PathLike, NoneType] = Nonedtype: typing.Union[str, torch.dtype, NoneType] = Noneoffload_state_dict: bool = Falseoffload_buffers: bool = Falsekeep_in_fp32_modules: list = Noneoffload_8bit_bnb: bool = Falsestrict: bool = False )

Parameters

  • model (torch.nn.Module) — The model in which we want to load a checkpoint.
  • checkpoint (str or os.PathLike) — The folder checkpoint to load. It can be:
    • a path to a file containing a whole model state dict
    • a path to a .json file containing the index to a sharded checkpoint
    • a path to a folder containing a unique .index.json file and the shards of a checkpoint.
    • a path to a folder containing a unique pytorch_model.bin or a model.safetensors file.
  • device_map (Dict[str, Union[int, str, torch.device]], optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device.
  • offload_folder (str or os.PathLike, optional) — If the device_map contains any value "disk", the folder where we will offload weights.
  • dtype (str or torch.dtype, optional) — If provided, the weights will be converted to that type when loaded.
  • offload_state_dict (bool, optional, defaults to False) — If True, will temporarily offload the CPU state dict on the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard does not fit.
  • offload_buffers (bool, optional, defaults to False) — Whether or not to include the buffers in the weights offloaded to disk.
  • keep_in_fp32_modules(List[str], optional) — A list of the modules that we keep in torch.float32 dtype.
  • offload_8bit_bnb (bool, optional) — Whether or not to enable offload of 8-bit modules on cpu/disk.
  • strict (bool, optional, defaults to False) — Whether to strictly enforce that the keys in the checkpoint state_dict match the keys of the model’s state_dict.

Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are loaded.

Once loaded across devices, you still need to call dispatch_model() on your model to make it able to run. To group the checkpoint loading and dispatch in one single call, use load_checkpoint_and_dispatch().

accelerate.utils.load_offloaded_weights

< >

( modelindexoffload_folder )

Parameters

  • model (torch.nn.Module) — The model to load the weights into.
  • index (dict) — A dictionary containing the parameter name and its metadata for each parameter that was offloaded from the model.
  • offload_folder (str) — The folder where the offloaded weights are stored.

Loads the weights from the offload folder into the model.

accelerate.utils.load_state_dict

< >

( checkpoint_filedevice_map = None )

Parameters

  • checkpoint_file (str) — The path to the checkpoint to load.
  • device_map (Dict[str, Union[int, str, torch.device]], optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device.

Load a checkpoint from a given file. If the checkpoint is in the safetensors format and a device map is passed, the weights can be fast-loaded directly on the GPU.

accelerate.utils.offload_state_dict

< >

( save_dir: typing.Union[str, os.PathLike]state_dict: dict )

Parameters

  • save_dir (str or os.PathLike) — The directory in which to offload the state dict.
  • state_dict (Dict[str, torch.Tensor]) — The dictionary of tensors to offload.

Offload a state dict in a given folder.

accelerate.utils.retie_parameters

< >

( modeltied_params )

Parameters

  • model (torch.nn.Module) — The model in which to retie parameters.
  • tied_params (List[List[str]]) — A mapping parameter name to tied parameter name as obtained by find_tied_parameters.

Reties tied parameters in a given model if the link was broken (for instance when adding hooks).

accelerate.utils.set_module_tensor_to_device

< >

( module: Moduletensor_name: strdevice: typing.Union[int, str, torch.device]value: typing.Optional[torch.Tensor] = Nonedtype: typing.Union[str, torch.dtype, NoneType] = Nonefp16_statistics: typing.Optional[torch.HalfTensor] = Nonetied_params_map: typing.Optional[dict[int, dict[torch.device, torch.Tensor]]] = None )

Parameters

  • module (torch.nn.Module) — The module in which the tensor we want to move lives.
  • tensor_name (str) — The full name of the parameter/buffer.
  • device (int, str or torch.device) — The device on which to set the tensor.
  • value (torch.Tensor, optional) — The value of the tensor (useful when going from the meta device to any other device).
  • dtype (torch.dtype, optional) — If passed along the value of the parameter will be cast to this dtype. Otherwise, value will be cast to the dtype of the existing parameter in the model.
  • fp16_statistics (torch.HalfTensor, optional) — The list of fp16 statistics to set on the module, used for 8 bit model serialization.
  • tied_params_map (Dict[int, Dict[torch.device, torch.Tensor]], optional, defaults to None) — A map of current data pointers to dictionaries of devices to already dispatched tied weights. For a given execution device, this parameter is useful to reuse the first available pointer of a shared weight on the device for all others, instead of duplicating memory.

A helper function to set a given tensor (parameter of buffer) of a module on a specific device (note that doing param.to(device) creates a new tensor not linked to the parameter, which is why we need this function).

accelerate.utils.get_module_children_bottom_up

< >

( model: Module ) list[torch.nn.Module]

Parameters

  • model (torch.nn.Module) — the model to get the children of

Returns

list[torch.nn.Module]

a list of children modules of model in bottom-up order. The last element is the model itself.

Traverse the model in bottom-up order and return the children modules in that order.

Parallel

These include general utilities that should be used when working in parallel.

accelerate.utils.extract_model_from_parallel

< >

( modelkeep_fp32_wrapper: bool = Truekeep_torch_compile: bool = Truerecursive: bool = False ) torch.nn.Module

Parameters

  • model (torch.nn.Module) — The model to extract.
  • keep_fp32_wrapper (bool, optional) — Whether to remove mixed precision hooks from the model.
  • keep_torch_compile (bool, optional) — Whether to unwrap compiled model.
  • recursive (bool, optional, defaults to False) — Whether to recursively extract all cases of module.module from model as well as unwrap child sublayers recursively, not just the top-level distributed containers.

Returns

torch.nn.Module

The extracted model.

Extract a model from its distributed containers.

accelerate.utils.save

< >

( objfsave_on_each_node: bool = Falsesafe_serialization: bool = False )

Parameters

  • obj — The data to save
  • f — The file (or file-like object) to use to save the data
  • save_on_each_node (bool, optional, defaults to False) — Whether to only save on the global main process
  • safe_serialization (bool, optional, defaults to False) — Whether to save obj using safetensors or the traditional PyTorch way (that uses pickle).

Save the data to disk. Use in place of torch.save().

accelerate.utils.load

< >

( fmap_location = None**kwargs )

Parameters

  • f — The file (or file-like object) to use to load the data
  • map_location — a function, torch.device, string or a dict specifying how to remap storage locations
  • **kwargs — Additional keyword arguments to pass to torch.load().

Compatible drop-in replacement of torch.load() which allows for weights_only to be used if torch version is 2.4.0 or higher. Otherwise will ignore the kwarg.

Will also add (and then remove) an exception for numpy arrays

accelerate.utils.wait_for_everyone

< >

( )

Introduces a blocking point in the script, making sure all processes have reached this point before continuing.

Make sure all processes will reach this instruction otherwise one of your processes will hang forever.

Random

These utilities relate to setting and synchronizing of all the random states.

accelerate.utils.set_seed

< >

( seed: intdevice_specific: bool = Falsedeterministic: bool = False )

Parameters

  • seed (int) — The seed to set.
  • device_specific (bool, optional, defaults to False) — Whether to differ the seed on each device slightly with self.process_index.
  • deterministic (bool, optional, defaults to False) — Whether to use deterministic algorithms where available. Can slow down training.

Helper function for reproducible behavior to set the seed in random, numpy, torch.

accelerate.utils.synchronize_rng_state

< >

( rng_type: typing.Optional[accelerate.utils.dataclasses.RNGType] = Nonegenerator: typing.Optional[torch._C.Generator] = None )

accelerate.synchronize_rng_states

< >

( rng_types: listgenerator: typing.Optional[torch._C.Generator] = None )

PyTorch XLA

These include utilities that are useful while using PyTorch with XLA.

accelerate.utils.install_xla

< >

( upgrade: bool = False )

Parameters

  • upgrade (bool, optional, defaults to False) — Whether to upgrade torch and install the latest torch_xla wheels.

Helper function to install appropriate xla wheels based on the torch version in Google Colaboratory.

Example:

>>> from accelerate.utils import install_xla

>>> install_xla(upgrade=True)

Loading model weights

These include utilities that are useful to load checkpoints.

accelerate.load_checkpoint_in_model

< >

( model: Modulecheckpoint: typing.Union[str, os.PathLike]device_map: typing.Optional[dict[str, typing.Union[int, str, torch.device]]] = Noneoffload_folder: typing.Union[str, os.PathLike, NoneType] = Nonedtype: typing.Union[str, torch.dtype, NoneType] = Noneoffload_state_dict: bool = Falseoffload_buffers: bool = Falsekeep_in_fp32_modules: list = Noneoffload_8bit_bnb: bool = Falsestrict: bool = False )

Parameters

  • model (torch.nn.Module) — The model in which we want to load a checkpoint.
  • checkpoint (str or os.PathLike) — The folder checkpoint to load. It can be:
    • a path to a file containing a whole model state dict
    • a path to a .json file containing the index to a sharded checkpoint
    • a path to a folder containing a unique .index.json file and the shards of a checkpoint.
    • a path to a folder containing a unique pytorch_model.bin or a model.safetensors file.
  • device_map (Dict[str, Union[int, str, torch.device]], optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device.
  • offload_folder (str or os.PathLike, optional) — If the device_map contains any value "disk", the folder where we will offload weights.
  • dtype (str or torch.dtype, optional) — If provided, the weights will be converted to that type when loaded.
  • offload_state_dict (bool, optional, defaults to False) — If True, will temporarily offload the CPU state dict on the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard does not fit.
  • offload_buffers (bool, optional, defaults to False) — Whether or not to include the buffers in the weights offloaded to disk.
  • keep_in_fp32_modules(List[str], optional) — A list of the modules that we keep in torch.float32 dtype.
  • offload_8bit_bnb (bool, optional) — Whether or not to enable offload of 8-bit modules on cpu/disk.
  • strict (bool, optional, defaults to False) — Whether to strictly enforce that the keys in the checkpoint state_dict match the keys of the model’s state_dict.

Loads a (potentially sharded) checkpoint inside a model, potentially sending weights to a given device as they are loaded.

Once loaded across devices, you still need to call dispatch_model() on your model to make it able to run. To group the checkpoint loading and dispatch in one single call, use load_checkpoint_and_dispatch().

Quantization

These include utilities that are useful to quantize model.

accelerate.utils.load_and_quantize_model

< >

( model: Modulebnb_quantization_config: BnbQuantizationConfigweights_location: typing.Union[str, os.PathLike] = Nonedevice_map: typing.Optional[dict[str, typing.Union[int, str, torch.device]]] = Noneno_split_module_classes: typing.Optional[list[str]] = Nonemax_memory: typing.Optional[dict[typing.Union[int, str], typing.Union[int, str]]] = Noneoffload_folder: typing.Union[str, os.PathLike, NoneType] = Noneoffload_state_dict: bool = False ) torch.nn.Module

Parameters

  • model (torch.nn.Module) — Input model. The model can be already loaded or on the meta device
  • bnb_quantization_config (BnbQuantizationConfig) — The bitsandbytes quantization parameters
  • weights_location (str or os.PathLike) — The folder weights_location to load. It can be:
    • a path to a file containing a whole model state dict
    • a path to a .json file containing the index to a sharded checkpoint
    • a path to a folder containing a unique .index.json file and the shards of a checkpoint.
    • a path to a folder containing a unique pytorch_model.bin file.
  • device_map (Dict[str, Union[int, str, torch.device]], optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device.
  • no_split_module_classes (List[str], optional) — A list of layer class names that should never be split across device (for instance any layer that has a residual connection).
  • max_memory (Dict, optional) — A dictionary device identifier to maximum memory. Will default to the maximum memory available if unset.
  • offload_folder (str or os.PathLike, optional) — If the device_map contains any value "disk", the folder where we will offload weights.
  • offload_state_dict (bool, optional, defaults to False) — If True, will temporarily offload the CPU state dict on the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard does not fit.

Returns

torch.nn.Module

The quantized model

This function will quantize the input model with the associated config passed in bnb_quantization_config. If the model is in the meta device, we will load and dispatch the weights according to the device_map passed. If the model is already loaded, we will quantize the model and put the model on the GPU,

< > Update on GitHub