Transformers documentation
ModernBERT
ModernBERT
ModernBERT is a modernized version of BERT
trained on 2T tokens. It brings many improvements to the original architecture such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention.
You can find all the original ModernBERT checkpoints under the ModernBERT collection.
Click on the ModernBERT models in the right sidebar for more examples of how to apply ModernBERT to different language tasks.
The example below demonstrates how to predict the [MASK]
token with Pipeline, AutoModel, and from the command line.
import torch
from transformers import pipeline
pipeline = pipeline(
task="fill-mask",
model="answerdotai/ModernBERT-base",
torch_dtype=torch.float16,
device=0
)
pipeline("Plants create [MASK] through a process known as photosynthesis.")
ModernBertConfig
class transformers.ModernBertConfig
< source >( vocab_size = 50368 hidden_size = 768 intermediate_size = 1152 num_hidden_layers = 22 num_attention_heads = 12 hidden_activation = 'gelu' max_position_embeddings = 8192 initializer_range = 0.02 initializer_cutoff_factor = 2.0 norm_eps = 1e-05 norm_bias = False pad_token_id = 50283 eos_token_id = 50282 bos_token_id = 50281 cls_token_id = 50281 sep_token_id = 50282 global_rope_theta = 160000.0 attention_bias = False attention_dropout = 0.0 global_attn_every_n_layers = 3 local_attention = 128 local_rope_theta = 10000.0 embedding_dropout = 0.0 mlp_bias = False mlp_dropout = 0.0 decoder_bias = True classifier_pooling: typing.Literal['cls', 'mean'] = 'cls' classifier_dropout = 0.0 classifier_bias = False classifier_activation = 'gelu' deterministic_flash_attn = False sparse_prediction = False sparse_pred_ignore_index = -100 reference_compile = None repad_logits_with_grad = False **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 50368) — Vocabulary size of the ModernBert model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling ModernBertModel - hidden_size (
int
, optional, defaults to 768) — Dimension of the hidden representations. - intermediate_size (
int
, optional, defaults to 1152) — Dimension of the MLP representations. - num_hidden_layers (
int
, optional, defaults to 22) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer decoder. - hidden_activation (
str
orfunction
, optional, defaults to"gelu"
) — The non-linear activation function (function or string) in the decoder. Will default to"gelu"
if not specified. - max_position_embeddings (
int
, optional, defaults to 8192) — The maximum sequence length that this model might ever be used with. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_cutoff_factor (
float
, optional, defaults to 2.0) — The cutoff factor for the truncated_normal_initializer for initializing all weight matrices. - norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - norm_bias (
bool
, optional, defaults toFalse
) — Whether to use bias in the normalization layers. - pad_token_id (
int
, optional, defaults to 50283) — Padding token id. - eos_token_id (
int
, optional, defaults to 50282) — End of stream token id. - bos_token_id (
int
, optional, defaults to 50281) — Beginning of stream token id. - cls_token_id (
int
, optional, defaults to 50281) — Classification token id. - sep_token_id (
int
, optional, defaults to 50282) — Separation token id. - global_rope_theta (
float
, optional, defaults to 160000.0) — The base period of the global RoPE embeddings. - attention_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - global_attn_every_n_layers (
int
, optional, defaults to 3) — The number of layers between global attention layers. - local_attention (
int
, optional, defaults to 128) — The window size for local attention. - local_rope_theta (
float
, optional, defaults to 10000.0) — The base period of the local RoPE embeddings. - embedding_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the embeddings. - mlp_bias (
bool
, optional, defaults toFalse
) — Whether to use bias in the MLP layers. - mlp_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the MLP layers. - decoder_bias (
bool
, optional, defaults toTrue
) — Whether to use bias in the decoder layers. - classifier_pooling (
str
, optional, defaults to"cls"
) — The pooling method for the classifier. Should be either"cls"
or"mean"
. In local attention layers, the CLS token doesn’t attend to all tokens on long sequences. - classifier_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the classifier. - classifier_bias (
bool
, optional, defaults toFalse
) — Whether to use bias in the classifier. - classifier_activation (
str
, optional, defaults to"gelu"
) — The activation function for the classifier. - deterministic_flash_attn (
bool
, optional, defaults toFalse
) — Whether to use deterministic flash attention. IfFalse
, inference will be faster but not deterministic. - sparse_prediction (
bool
, optional, defaults toFalse
) — Whether to use sparse prediction for the masked language model instead of returning the full dense logits. - sparse_pred_ignore_index (
int
, optional, defaults to -100) — The index to ignore for the sparse prediction. - reference_compile (
bool
, optional) — Whether to compile the layers of the model which were compiled during pretraining. IfNone
, then parts of the model will be compiled if 1)triton
is installed, 2) the model is not on MPS, 3) the model is not shared between devices, and 4) the model is not resized after initialization. IfTrue
, then the model may be faster in some scenarios. - repad_logits_with_grad (
bool
, optional, defaults toFalse
) — When True, ModernBertForMaskedLM keeps track of the logits’ gradient when repadding for output. This only applies when using Flash Attention 2 with passed labels. Otherwise output logits always have a gradient.
This is the configuration class to store the configuration of a ModernBertModel. It is used to instantiate an ModernBert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ModernBERT-base. e.g. answerdotai/ModernBERT-base
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Examples:
>>> from transformers import ModernBertModel, ModernBertConfig
>>> # Initializing a ModernBert style configuration
>>> configuration = ModernBertConfig()
>>> # Initializing a model from the modernbert-base style configuration
>>> model = ModernBertModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
ModernBertModel
class transformers.ModernBertModel
< source >( config: ModernBertConfig )
Parameters
- config (ModernBertConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare ModernBert Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None sliding_window_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None indices: typing.Optional[torch.Tensor] = None cu_seqlens: typing.Optional[torch.Tensor] = None max_seqlen: typing.Optional[int] = None batch_size: typing.Optional[int] = None seq_len: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. With Flash Attention 2.0, padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- sliding_window_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding or far-away tokens. In ModernBert, only every few layers perform global attention, while the rest perform local attention. This mask is used to avoid attending to far-away tokens in the local attention layers when not using Flash Attention. - position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - indices (
torch.Tensor
of shape(total_unpadded_tokens,)
, optional) — Indices of the non-padding tokens in the input sequence. Used for unpadding the output. - cu_seqlens (
torch.Tensor
of shape(batch + 1,)
, optional) — Cumulative sequence lengths of the input sequences. Used to index the unpadded tensors. - max_seqlen (
int
, optional) — Maximum sequence length in the batch excluding padding tokens. Used to unpad input_ids and pad output tensors. - batch_size (
int
, optional) — Batch size of the input sequences. Used to pad the output tensors. - seq_len (
int
, optional) — Sequence length of the input sequences including padding tokens. Used to pad the output tensors. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ModernBertConfig) and inputs.
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ModernBertModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, ModernBertModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
>>> model = ModernBertModel.from_pretrained("answerdotai/ModernBERT-base")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
ModernBertForMaskedLM
class transformers.ModernBertForMaskedLM
< source >( config: ModernBertConfig )
Parameters
- config (ModernBertConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The ModernBert Model with a decoder head on top that is used for masked language modeling. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None sliding_window_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None indices: typing.Optional[torch.Tensor] = None cu_seqlens: typing.Optional[torch.Tensor] = None max_seqlen: typing.Optional[int] = None batch_size: typing.Optional[int] = None seq_len: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs ) → transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. With Flash Attention 2.0, padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- sliding_window_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding or far-away tokens. In ModernBert, only every few layers perform global attention, while the rest perform local attention. This mask is used to avoid attending to far-away tokens in the local attention layers when not using Flash Attention. - position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - indices (
torch.Tensor
of shape(total_unpadded_tokens,)
, optional) — Indices of the non-padding tokens in the input sequence. Used for unpadding the output. - cu_seqlens (
torch.Tensor
of shape(batch + 1,)
, optional) — Cumulative sequence lengths of the input sequences. Used to index the unpadded tensors. - max_seqlen (
int
, optional) — Maximum sequence length in the batch excluding padding tokens. Used to unpad input_ids and pad output tensors. - batch_size (
int
, optional) — Batch size of the input sequences. Used to pad the output tensors. - seq_len (
int
, optional) — Sequence length of the input sequences including padding tokens. Used to pad the output tensors. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.MaskedLMOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ModernBertConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Masked language modeling (MLM) loss. -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ModernBertForMaskedLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, ModernBertForMaskedLM
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
>>> model = ModernBertForMaskedLM.from_pretrained("answerdotai/ModernBERT-base")
>>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> # retrieve index of [MASK]
>>> mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
>>> predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
>>> labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
>>> # mask labels of non-[MASK] tokens
>>> labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)
>>> outputs = model(**inputs, labels=labels)
ModernBertForSequenceClassification
class transformers.ModernBertForSequenceClassification
< source >( config: ModernBertConfig )
Parameters
- config (ModernBertConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The ModernBert Model with a sequence classification head on top that performs pooling. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None sliding_window_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None indices: typing.Optional[torch.Tensor] = None cu_seqlens: typing.Optional[torch.Tensor] = None max_seqlen: typing.Optional[int] = None batch_size: typing.Optional[int] = None seq_len: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. With Flash Attention 2.0, padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- sliding_window_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding or far-away tokens. In ModernBert, only every few layers perform global attention, while the rest perform local attention. This mask is used to avoid attending to far-away tokens in the local attention layers when not using Flash Attention. - position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - indices (
torch.Tensor
of shape(total_unpadded_tokens,)
, optional) — Indices of the non-padding tokens in the input sequence. Used for unpadding the output. - cu_seqlens (
torch.Tensor
of shape(batch + 1,)
, optional) — Cumulative sequence lengths of the input sequences. Used to index the unpadded tensors. - max_seqlen (
int
, optional) — Maximum sequence length in the batch excluding padding tokens. Used to unpad input_ids and pad output tensors. - batch_size (
int
, optional) — Batch size of the input sequences. Used to pad the output tensors. - seq_len (
int
, optional) — Sequence length of the input sequences including padding tokens. Used to pad the output tensors. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - labels (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Ifconfig.num_labels == 1
a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ModernBertConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Classification (or regression if config.num_labels==1) loss. -
logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ModernBertForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of single-label classification:
>>> import torch
>>> from transformers import AutoTokenizer, ModernBertForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
>>> model = ModernBertForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_id = logits.argmax().item()
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = ModernBertForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=num_labels)
>>> labels = torch.tensor([1])
>>> loss = model(**inputs, labels=labels).loss
Example of multi-label classification:
>>> import torch
>>> from transformers import AutoTokenizer, ModernBertForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
>>> model = ModernBertForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", problem_type="multi_label_classification")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]
>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = ModernBertForSequenceClassification.from_pretrained(
... "answerdotai/ModernBERT-base", num_labels=num_labels, problem_type="multi_label_classification"
... )
>>> labels = torch.sum(
... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
... ).to(torch.float)
>>> loss = model(**inputs, labels=labels).loss
ModernBertForTokenClassification
class transformers.ModernBertForTokenClassification
< source >( config: ModernBertConfig )
Parameters
- config (ModernBertConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The ModernBert Model with a token classification head on top, e.g. for Named Entity Recognition (NER) tasks. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None sliding_window_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None indices: typing.Optional[torch.Tensor] = None cu_seqlens: typing.Optional[torch.Tensor] = None max_seqlen: typing.Optional[int] = None batch_size: typing.Optional[int] = None seq_len: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. With Flash Attention 2.0, padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- sliding_window_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding or far-away tokens. In ModernBert, only every few layers perform global attention, while the rest perform local attention. This mask is used to avoid attending to far-away tokens in the local attention layers when not using Flash Attention. - position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - indices (
torch.Tensor
of shape(total_unpadded_tokens,)
, optional) — Indices of the non-padding tokens in the input sequence. Used for unpadding the output. - cu_seqlens (
torch.Tensor
of shape(batch + 1,)
, optional) — Cumulative sequence lengths of the input sequences. Used to index the unpadded tensors. - max_seqlen (
int
, optional) — Maximum sequence length in the batch excluding padding tokens. Used to unpad input_ids and pad output tensors. - batch_size (
int
, optional) — Batch size of the input sequences. Used to pad the output tensors. - seq_len (
int
, optional) — Sequence length of the input sequences including padding tokens. Used to pad the output tensors. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the token classification loss. Indices should be in[0, ..., config.num_labels - 1]
.
Returns
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ModernBertConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Classification loss. -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ModernBertForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, ModernBertForTokenClassification
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
>>> model = ModernBertForTokenClassification.from_pretrained("answerdotai/ModernBERT-base")
>>> inputs = tokenizer(
... "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt"
... )
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_token_class_ids = logits.argmax(-1)
>>> # Note that tokens are classified rather then input words which means that
>>> # there might be more predicted token classes than words.
>>> # Multiple token classes might account for the same word
>>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
>>> labels = predicted_token_class_ids
>>> loss = model(**inputs, labels=labels).loss
ModernBertForQuestionAnswering
class transformers.ModernBertForQuestionAnswering
< source >( config: ModernBertConfig )
Parameters
- config (ModernBertConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The ModernBert Model with a span classification head on top for extractive question-answering tasks like SQuAD
(a linear layer on top of the hidden-states output to compute span start logits
and span end logits
).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.Tensor] attention_mask: typing.Optional[torch.Tensor] = None sliding_window_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None start_positions: typing.Optional[torch.Tensor] = None end_positions: typing.Optional[torch.Tensor] = None indices: typing.Optional[torch.Tensor] = None cu_seqlens: typing.Optional[torch.Tensor] = None max_seqlen: typing.Optional[int] = None batch_size: typing.Optional[int] = None seq_len: typing.Optional[int] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs ) → transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. With Flash Attention 2.0, padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- sliding_window_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding or far-away tokens. In ModernBert, only every few layers perform global attention, while the rest perform local attention. This mask is used to avoid attending to far-away tokens in the local attention layers when not using Flash Attention. - position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - indices (
torch.Tensor
of shape(total_unpadded_tokens,)
, optional) — Indices of the non-padding tokens in the input sequence. Used for unpadding the output. - cu_seqlens (
torch.Tensor
of shape(batch + 1,)
, optional) — Cumulative sequence lengths of the input sequences. Used to index the unpadded tensors. - max_seqlen (
int
, optional) — Maximum sequence length in the batch excluding padding tokens. Used to unpad input_ids and pad output tensors. - batch_size (
int
, optional) — Batch size of the input sequences. Used to pad the output tensors. - seq_len (
int
, optional) — Sequence length of the input sequences including padding tokens. Used to pad the output tensors. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (ModernBertConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. -
start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Span-start scores (before SoftMax). -
end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Span-end scores (before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ModernBertForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, ModernBertForQuestionAnswering
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
>>> model = ModernBertForQuestionAnswering.from_pretrained("answerdotai/ModernBERT-base")
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> answer_start_index = outputs.start_logits.argmax()
>>> answer_end_index = outputs.end_logits.argmax()
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> # target is "nice puppet"
>>> target_start_index = torch.tensor([14])
>>> target_end_index = torch.tensor([15])
>>> outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index)
>>> loss = outputs.loss
Usage tips
The ModernBert model can be fine-tuned using the HuggingFace Transformers library with its official script for question-answering tasks.