Transformers documentation
Custom Layers and Utilities
Custom Layers and Utilities
This page lists all the custom layers used by the library, as well as the utility functions and classes it provides for modeling.
Most of those are only useful if you are studying the code of the models in the library.
Layers
Base class for layers with gradient checkpointing.
This class enables gradient checkpointing functionality for a layer. By default, gradient checkpointing is disabled
(gradient_checkpointing = False
). When model.set_gradient_checkpointing()
is called, gradient checkpointing is
enabled by setting gradient_checkpointing = True
and assigning a checkpointing function to _gradient_checkpointing_func
.
Important:
When using gradient checkpointing with use_reentrant=True
, inputs that require gradients (e.g. hidden states)
must be passed as positional arguments (*args
) rather than keyword arguments to properly propagate gradients.
Attention Functions
Dict-like object keeping track of allowed attention functions. You can easily add a new attention function
with a call to register()
. If a model needs to locally overwrite an existing attention function, say sdpa
,
it needs to declare a new instance of this class inside the modeling_<model>.py
, and declare it on that instance.
Rotary Position Embedding Functions
transformers.dynamic_rope_update
< source >( rope_forward )
Decorator function to update the RoPE parameters in the forward pass, if the model is using a dynamic RoPE (i.e. a RoPE implementation that may recompute its frequencies in the forward pass).
Pytorch custom modules
class transformers.Conv1D
< source >( nf nx )
1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).
Basically works like a linear layer but the weights are transposed.
PyTorch Helper Functions
transformers.apply_chunking_to_forward
< source >( forward_fn: Callable[..., torch.Tensor] chunk_size: int chunk_dim: int *input_tensors ) → torch.Tensor
Parameters
- forward_fn (
Callable[..., torch.Tensor]
) — The forward function of the model. - chunk_size (
int
) — The chunk size of a chunked tensor:num_chunks = len(input_tensors[0]) / chunk_size
. - chunk_dim (
int
) — The dimension over which theinput_tensors
should be chunked. - input_tensors (
Tuple[torch.Tensor]
) — The input tensors offorward_fn
which will be chunked
Returns
torch.Tensor
A tensor with the same shape as the forward_fn
would have given if applied`.
This function chunks the input_tensors
into smaller input tensor parts of size chunk_size
over the dimension
chunk_dim
. It then applies a layer forward_fn
to each chunk independently to save memory.
If the forward_fn
is independent across the chunk_dim
this function will yield the same result as directly
applying forward_fn
to input_tensors
.
Examples:
# rename the usual forward() fn to forward_chunk()
def forward_chunk(self, hidden_states):
hidden_states = self.decoder(hidden_states)
return hidden_states
# implement a chunked forward function
def forward(self, hidden_states):
return apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states)
transformers.pytorch_utils.find_pruneable_heads_and_indices
< source >( heads: list[int] n_heads: int head_size: int already_pruned_heads: set[int] ) → Tuple[Set[int], torch.LongTensor]
Parameters
- heads (
List[int]
) — List of the indices of heads to prune. - n_heads (
int
) — The number of heads in the model. - head_size (
int
) — The size of each head. - already_pruned_heads (
Set[int]
) — A set of already pruned heads.
Returns
Tuple[Set[int], torch.LongTensor]
A tuple with the indices of heads to prune taking already_pruned_heads
into account and the indices of rows/columns to keep in the layer weight.
Finds the heads and their indices taking already_pruned_heads
into account.
transformers.prune_layer
< source >( layer: nn.Linear | Conv1D index: torch.LongTensor dim: int | None = None ) → torch.nn.Linear
or Conv1D
Parameters
- layer (
Union[torch.nn.Linear, Conv1D]
) — The layer to prune. - index (
torch.LongTensor
) — The indices to keep in the layer. - dim (
int
, optional) — The dimension on which to keep the indices.
Returns
torch.nn.Linear
or Conv1D
The pruned layer as a new layer with requires_grad=True
.
Prune a Conv1D or linear layer to keep only entries in index.
Used to remove heads.
transformers.pytorch_utils.prune_conv1d_layer
< source >( layer: Conv1D index: torch.LongTensor dim: int = 1 ) → Conv1D
Prune a Conv1D layer to keep only entries in index. A Conv1D work as a Linear layer (see e.g. BERT) but the weights are transposed.
Used to remove heads.
transformers.pytorch_utils.prune_linear_layer
< source >( layer: nn.Linear index: torch.LongTensor dim: int = 0 ) → torch.nn.Linear
Prune a linear layer to keep only entries in index.
Used to remove heads.
TensorFlow custom layers
class transformers.modeling_tf_utils.TFConv1D
< source >( nf nx initializer_range = 0.02 **kwargs )
Parameters
- nf (
int
) — The number of output features. - nx (
int
) — The number of input features. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation to use to initialize the weights. - kwargs (
Dict[str, Any]
, optional) — Additional keyword arguments passed along to the__init__
ofkeras.layers.Layer
.
1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).
Basically works like a linear layer but the weights are transposed.
class transformers.TFSequenceSummary
< source >( config: PretrainedConfig initializer_range: float = 0.02 **kwargs )
Parameters
- config (PretrainedConfig) —
The config used by the model. Relevant arguments in the config class of the model are (refer to the actual
config class of your model for the default values it uses):
-
summary_type (
str
) — The method to use to make this summary. Accepted values are:"last"
— Take the last token hidden state (like XLNet)"first"
— Take the first token hidden state (like Bert)"mean"
— Take the mean of all tokens hidden states"cls_index"
— Supply a Tensor of classification token position (GPT/GPT-2)"attn"
— Not implemented now, use multi-head attention
-
summary_use_proj (
bool
) — Add a projection after the vector extraction. -
summary_proj_to_labels (
bool
) — IfTrue
, the projection outputs toconfig.num_labels
classes (otherwise toconfig.hidden_size
). -
summary_activation (
Optional[str]
) — Set to"tanh"
to add a tanh activation to the output, another string orNone
will add no activation. -
summary_first_dropout (
float
) — Optional dropout probability before the projection and activation. -
summary_last_dropout (
float
)— Optional dropout probability after the projection and activation.
-
- initializer_range (
float
, optional, defaults to 0.02) — The standard deviation to use to initialize the weights. - kwargs (
Dict[str, Any]
, optional) — Additional keyword arguments passed along to the__init__
ofkeras.layers.Layer
.
Compute a single vector summary of a sequence hidden states.
TensorFlow loss functions
Loss function suitable for causal language modeling (CLM), that is, the task of guessing the next token.
Any label of -100 will be ignored (along with the corresponding logits) in the loss computation.
Loss function suitable for masked language modeling (MLM), that is, the task of guessing the masked tokens.
Any label of -100 will be ignored (along with the corresponding logits) in the loss computation.
Loss function suitable for multiple choice tasks.
Loss function suitable for question answering.
Loss function suitable for sequence classification.
Loss function suitable for token classification.
Any label of -100 will be ignored (along with the corresponding logits) in the loss computation.
TensorFlow Helper Functions
transformers.modeling_tf_utils.get_initializer
< source >( initializer_range: float = 0.02 ) → keras.initializers.TruncatedNormal
Creates a keras.initializers.TruncatedNormal
with the given range.
transformers.modeling_tf_utils.keras_serializable
< source >( )
Decorate a Keras Layer class to support Keras serialization.
This is done by:
- Adding a
transformers_config
dict to the Keras config dictionary inget_config
(called by Keras at serialization time. - Wrapping
__init__
to accept thattransformers_config
dict (passed by Keras at deserialization time) and convert it to a config object for the actual layer initializer. - Registering the class as a custom object in Keras (if the Tensorflow version supports this), so that it does not
need to be supplied in
custom_objects
in the call tokeras.models.load_model
.
transformers.shape_list
< source >( tensor: typing.Union[tensorflow.python.framework.tensor.Tensor, numpy.ndarray] ) → List[int]
Deal with dynamic shape in tensorflow cleanly.