Transformers documentation
Csm
Csm
Overview
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture: CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.

Usage Tips
Without Conversational Context
CSM can be used to simply generate speech from a text prompt:
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
text = "[0]The past is just a story we tell ourselves." # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)
# another equivalent way to prepare the inputs
conversation = [
{"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_without_context.wav")
With Conversational Context
CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []
# 1. context
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
conversation.append(
{
"role": f"{speaker_id}",
"content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
}
)
# 2. text prompt
conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_with_context.wav")
Batched Inference
CSM supports batched inference!
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# here a batch with two prompts
conversation = [
[
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[0]["text"]},
{"type": "audio", "path": ds[0]["audio"]["array"]},
],
},
{
"role": f"{ds[1]['speaker_id']}",
"content": [
{"type": "text", "text": ds[1]["text"]},
],
},
],
[
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[0]["text"]},
],
}
],
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])
Making The Model Go Brrr
CSM supports full-graph compilation with CUDA graphs!
import torch
import copy
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset
model_id = "eustlb/csm-1b"
device = "cuda"
# set logs to ensure no recompilation and graph breaks
torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
model.generation_config.max_length = 250 # big enough to avoid recompilation
model.generation_config.max_new_tokens = None # would take precedence over max_length
model.generation_config.cache_implementation = "static"
model.depth_decoder.generation_config.cache_implementation = "static"
# generation kwargs
gen_kwargs = {
"do_sample": False,
"depth_decoder_do_sample": False,
"temperature": 1.0,
"depth_decoder_temperature": 1.0,
}
# Define a timing decorator
class TimerContext:
def __init__(self, name="Execution"):
self.name = name
self.start_event = None
self.end_event = None
def __enter__(self):
# Use CUDA events for more accurate GPU timing
self.start_event = torch.cuda.Event(enable_timing=True)
self.end_event = torch.cuda.Event(enable_timing=True)
self.start_event.record()
return self
def __exit__(self, *args):
self.end_event.record()
torch.cuda.synchronize()
elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
print(f"{self.name} time: {elapsed_time:.4f} seconds")
# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
conversation = [
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[0]["text"]},
{"type": "audio", "path": ds[0]["audio"]["array"]},
],
},
{
"role": f"{ds[1]['speaker_id']}",
"content": [
{"type": "text", "text": ds[1]["text"]},
{"type": "audio", "path": ds[1]["audio"]["array"]},
],
},
{
"role": f"{ds[2]['speaker_id']}",
"content": [
{"type": "text", "text": ds[2]["text"]},
],
},
]
padded_inputs_1 = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
print("\n" + "="*50)
print("First generation - compiling and recording CUDA graphs...")
with TimerContext("First generation"):
_ = model.generate(**padded_inputs_1, **gen_kwargs)
print("="*50)
print("\n" + "="*50)
print("Second generation - fast !!!")
with TimerContext("Second generation"):
_ = model.generate(**padded_inputs_1, **gen_kwargs)
print("="*50)
# now with different inputs
conversation = [
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[2]["text"]},
{"type": "audio", "path": ds[2]["audio"]["array"]},
],
},
{
"role": f"{ds[1]['speaker_id']}",
"content": [
{"type": "text", "text": ds[3]["text"]},
{"type": "audio", "path": ds[3]["audio"]["array"]},
],
},
{
"role": f"{ds[2]['speaker_id']}",
"content": [
{"type": "text", "text": ds[4]["text"]},
],
},
]
padded_inputs_2 = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
print("\n" + "="*50)
print("Generation with other inputs!")
with TimerContext("Generation with different inputs"):
_ = model.generate(**padded_inputs_2, **gen_kwargs)
print("="*50)
Training
CSM Transformers integration supports training!
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b"
device = "cuda"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model.train()
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []
# context
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
conversation.append(
{
"role": f"{speaker_id}",
"content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
}
)
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
output_labels=True,
).to(device)
out = model(**inputs)
out.loss.backward()
This model was contributed by Eustache Le Bihan. The original code can be found here.
CsmConfig
class transformers.CsmConfig
< source >( num_codebooks = 32 vocab_size = 2051 text_vocab_size = 128256 hidden_size = 2048 intermediate_size = 8192 num_hidden_layers = 16 num_attention_heads = 32 num_key_value_heads = 8 hidden_act = 'silu' max_position_embeddings = 2048 initializer_range = 0.02 rms_norm_eps = 1e-05 use_cache = True pad_token_id = 128002 codebook_pad_token_id = 2050 codebook_eos_token_id = 0 bos_token_id = 128000 eos_token_id = None audio_token_id = 128002 audio_eos_token_id = 128003 rope_theta = 500000 rope_scaling = None attention_bias = False attention_dropout = 0.0 mlp_bias = False head_dim = None tie_codebooks_embeddings = True depth_decoder_config = None codec_config = None **kwargs )
Parameters
- num_codebooks (
int
, optional, defaults to 32) — Number of codebooks used in the underlying codec model responsible for tokenizing the audio. - vocab_size (
int
, optional, defaults to 2051) — Vocabulary size of the Csm model. Defines the number of different audio tokens that can be represented by each codebook. - text_vocab_size (
int
, optional, defaults to 128256) — Vocabulary size of the text input for the Csm model. Defines the number of different text tokens that can be represented. - hidden_size (
int
, optional, defaults to 2048) — Dimension of the hidden representations of the backbone model. - intermediate_size (
int
, optional, defaults to 8192) — Dimension of the MLP representations of the backbone model. - num_hidden_layers (
int
, optional, defaults to 16) — Number of hidden layers in the backbone model Transformer decoder. - num_attention_heads (
int
, optional, defaults to 32) — Number of attention heads for each attention layer in the backbone model Transformer decoder. - num_key_value_heads (
int
, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1
the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. - hidden_act (
str
orfunction
, optional, defaults to"silu"
) — The non-linear activation function (function or string) in the backbone model Transformer decoder. - max_position_embeddings (
int
, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True
. - pad_token_id (
int
, optional, defaults to 128002) — Padding token id. - codebook_pad_token_id (
int
, optional, defaults to 2050) — Padding token id for codebook tokens. - codebook_eos_token_id (
int
, optional, defaults to 0) — End of stream token id for codebook tokens. - bos_token_id (
int
, optional, defaults to 128000) — Beginning of stream token id. - eos_token_id (
int
, optional) — End of stream token id. - audio_token_id (
int
, optional, defaults to 128002) — Audio token id in the text input. - audio_eos_token_id (
int
, optional, defaults to 128003) — End of stream token id for audio in the text input. - rope_theta (
float
, optional, defaults to 500000) — The base period of the RoPE embeddings. - rope_scaling (
Dict
, optional, defaults to{'factor' -- 32.0, 'high_freq_factor': 0.5, 'low_freq_factor': 0.125, 'original_max_position_embeddings': 1024, 'rope_type': 'llama3'}
): Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longermax_position_embeddings
, we recommend you to update this value accordingly. Expected contents:rope_type
(str
): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.factor
(float
, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, afactor
of x will enable the model to handle sequences of length x original maximum pre-trained length.original_max_position_embeddings
(int
, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.attention_factor
(float
, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using thefactor
field to infer the suggested value.beta_fast
(float
, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.beta_slow
(float
, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.short_factor
(List[float]
, optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (<original_max_position_embeddings
). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2long_factor
(List[float]
, optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (<original_max_position_embeddings
). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2low_freq_factor
(float
, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPEhigh_freq_factor
(float
, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE - attention_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - mlp_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. - head_dim (
int
, optional) — The attention head dimension. If None, it will default to hidden_size // num_attention_heads - tie_codebooks_embeddings (
bool
, optional, defaults toTrue
) — Whether to tie the codebook tokens embeddings of the backbone model to the codebook tokens embeddings of the depth decoder. - depth_decoder_config (
CsmDepthDecoderConfig
, optional) — Configuration for the depth decoder. - codec_config (
PretrainedConfig
, optional) — Configuration for the codec.
This is the configuration class to store the configuration of a CsmForConditionalGeneration. It is used to instantiate an CSM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the csm-1b.
e.g. eustlb/csm-1b
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
CsmDepthDecoderConfig
class transformers.CsmDepthDecoderConfig
< source >( num_codebooks = 32 backbone_hidden_size = 2048 vocab_size = 2051 hidden_size = 1024 intermediate_size = 8192 num_hidden_layers = 4 num_attention_heads = 8 num_key_value_heads = 2 hidden_act = 'silu' max_position_embeddings = 33 initializer_range = 0.02 rms_norm_eps = 1e-05 use_cache = True pad_token_id = None bos_token_id = None eos_token_id = None rope_theta = 500000 rope_scaling = None attention_bias = False attention_dropout = 0.0 mlp_bias = False head_dim = None **kwargs )
Parameters
- num_codebooks (
int
, optional, defaults to 32) — Number of codebooks used in the underlying codec model responsible for tokenizing the audio. - backbone_hidden_size (
int
, optional, defaults to 2048) — Dimension of the hidden representations of the backbone model used with this depth decoder. - vocab_size (
int
, optional, defaults to 2051) — Vocabulary size of the CsmDepthDecoder model. Defines the number of different audio tokens that can be represented by each codebook. - hidden_size (
int
, optional, defaults to 1024) — Dimension of the hidden representations. - intermediate_size (
int
, optional, defaults to 8192) — Dimension of the MLP representations. - num_hidden_layers (
int
, optional, defaults to 4) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int
, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int
, optional, defaults to 2) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1
the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default tonum_attention_heads
. - hidden_act (
str
orfunction
, optional, defaults to"silu"
) — The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int
, optional, defaults to 33) — The maximum sequence length that this model might ever be used with. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True
. - pad_token_id (
int
, optional, defaults to 2050) — Padding token id. - bos_token_id (
int
, optional) — Beginning of stream token id. - eos_token_id (
int
, optional) — End of stream token id. - rope_theta (
float
, optional, defaults to 500000) — The base period of the RoPE embeddings. - rope_scaling (
Dict
, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longermax_position_embeddings
, we recommend you to update this value accordingly. Expected contents:rope_type
(str
): The sub-variant of RoPE to use. Can be one of [‘default’, ‘linear’, ‘dynamic’, ‘yarn’, ‘longrope’, ‘llama3’], with ‘default’ being the original RoPE implementation.factor
(float
, optional): Used with all rope types except ‘default’. The scaling factor to apply to the RoPE embeddings. In most scaling types, afactor
of x will enable the model to handle sequences of length x original maximum pre-trained length.original_max_position_embeddings
(int
, optional): Used with ‘dynamic’, ‘longrope’ and ‘llama3’. The original max position embeddings used during pretraining.attention_factor
(float
, optional): Used with ‘yarn’ and ‘longrope’. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using thefactor
field to infer the suggested value.beta_fast
(float
, optional): Only used with ‘yarn’. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32.beta_slow
(float
, optional): Only used with ‘yarn’. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1.short_factor
(List[float]
, optional): Only used with ‘longrope’. The scaling factor to be applied to short contexts (<original_max_position_embeddings
). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2long_factor
(List[float]
, optional): Only used with ‘longrope’. The scaling factor to be applied to long contexts (<original_max_position_embeddings
). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2low_freq_factor
(float
, optional): Only used with ‘llama3’. Scaling factor applied to low frequency components of the RoPEhigh_freq_factor
(float
, optional*): Only used with ‘llama3’. Scaling factor applied to high frequency components of the RoPE - attention_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - mlp_bias (
bool
, optional, defaults toFalse
) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. - head_dim (
int
, optional) — The attention head dimension. If None, it will default to hidden_size // num_attention_heads
This is the configuration class to store the configuration of a CsmDepthDecoderModel. It is used to instantiate an CSM depth decoder model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the csm-1b.
e.g. eustlb/csm-1b
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
CsmProcessor
class transformers.CsmProcessor
< source >( feature_extractor tokenizer chat_template = None )
Parameters
- feature_extractor (EncodecFeatureExtractor) — The feature extractor is a required input.
- tokenizer ([
PreTrainedTokenizer
,PreTrainedTokenizerFast
]) — The tokenizer is a required input. - chat_template (
str
, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
Constructs a Csm processor which wraps EncodecFeatureExtractor and
PretrainedTokenizerFast
into a single processor that inherits both the audio feature extraction and
tokenizer functionalities. See the call() for more
information.
The preferred way of passing kwargs is as a dictionary per modality, see usage example below.
from transformers import CsmProcessor
from datasets import load_dataset
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
audio = ds[0]["audio"]["array"]
processor = CsmProcessor.from_pretrained("eustlb/csm-1b")
processor(
text=["<|begin_of_text|>[0]What are you working on?<|end_of_text|><|AUDIO|><|audio_eos|><|begin_of_text|>[1]I'm figuring out my budget.<|end_of_text|>"],
audio=audio,
text_kwargs = {"padding": False},
audio_kwargs = {"sampling_rate": 16000},
common_kwargs = {"return_tensors": "pt"},
)
# this should error out because EncodecFeatureExtractor expects a 24kHz audio :)
__call__
< source >( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[numpy.ndarray], typing.Tuple[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')], typing.Tuple[ForwardRef('torch.Tensor')], NoneType] = None output_labels: typing.Optional[bool] = False depth_decoder_labels_ratio: typing.Optional[float] = 1.0 **kwargs: typing_extensions.Unpack[transformers.models.csm.processing_csm.CsmProcessorKwargs] ) → BatchFeature
Parameters
- audio (
np.ndarray
,torch.Tensor
,List[np.ndarray]
,List[torch.Tensor]
) — The audio or batch of audio to be prepared. Each audio can be a NumPy array or PyTorch tensor. - text (
str
,List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - output_labels (bool, optional, default=False) —
Whether to return labels for training. Indices will be in
[config.audio_token_id, -100, -101]
.config.audio_token_id
indicates an audio frame (considering sequence length elements as frames)-100
will be ignored in the loss computation-101
indicates the audio frame will be used only for the backbone model (using the first codebook token as labels)
- depth_decoder_labels_ratio (float, optional, default=1.0) — The ratio of audio frames to keep for the depth decoder labels.
- return_tensors (
str
or TensorType, optional) — If set, will return tensors of a particular framework. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return NumPynp.ndarray
objects.'jax'
: Return JAXjnp.ndarray
objects.
Returns
A BatchFeature with the following fields:
- input_ids — List of token ids to be fed to a model. Returned when
text
is notNone
. - input_values — List of audio values to be fed to a model. Returned when
audio
is notNone
. - attention_mask — List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True
or if “attention_mask” is inself.model_input_names
and iftext
is notNone
). - labels — List of labels for the audio frames. Returned when
output_labels=True
.
Main method to prepare text(s) and audio to be fed as input to the model. This method forwards the text
arguments to PreTrainedTokenizerFast’s call() to encode
the text. To prepare the audio, this method forwards the audio
arguments to
EncodecFeatureExtractor’s call(). Please refer
to the docstring of the above two methods for more information.
CsmForConditionalGeneration
class transformers.CsmForConditionalGeneration
< source >( config )
Parameters
- config (CsmConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Csm model consists of two llama-like auto-regressive transformer models: a backbone model that predicts the first codebook token and a depth decoder that predicts the other codebook tokens.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None input_values: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None input_values_cutoffs: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.csm.modeling_csm.KwargsForCausalLM] ) → transformers.models.csm.modeling_csm.CsmOutputWithPast
or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length, num_codebooks) or (batch_size, sequence_length)
) —-
(batch_size, sequence_length): corresponds to the input sequence prepared with the processor from the text prompt. Such input requires
input_values
to be provided so that audio can be encoded in codebook tokens and then merged with the text tokens. -
(batch_size, sequence_length, num_codebooks): codebook tokens generated during the autoregressive decoding. Such input is not meant to be used by end users.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
-
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
ortuple(tuple(torch.FloatTensor))
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - input_values_cutoffs (
torch.Tensor
of shape(batch_size, max_num_audio)
, optional) — Specify the end positions of audio segments within each batch entry, relative to the concatenated audio input. If a batch entry has fewer segments than the maximum, it is padded with -1. For example, in a batch of 2 sequences where the first contains 2 audio segments of length l1, and the second contains 1 audio segment of length l2, the input_values_cutoffs would be: [[l1, 2 * l1], [l2, -1]]. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should be in[config.audio_token_id, -100, -101]
. Requires targetedinput_values
to be provided as audio tokens will be infered from it using thecodec_model
.config.audio_token_id
indicates an audio frames (considering sequence length elements as frames)-100
will be ignored in the loss computation-101
indicates the audio frame will be used only for the backbone model (using the first codebook token as labels)
Such labels can be prepared using
output_labels=True
when calling CsmProcessor. - logits_to_keep (
int
ortorch.Tensor
, optional) — Kept for compatibility. Does not support another value than:0
, which is equivalent to keeping all logits, used in the training regime1
, which is equivalent to keeping only the last logit, used in the generation regime
Returns
transformers.models.csm.modeling_csm.CsmOutputWithPast
or tuple(torch.FloatTensor)
A transformers.models.csm.modeling_csm.CsmOutputWithPast
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (CsmConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
depth_decoder_loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for next-token prediction) of the depth decoder model. -
depth_decoder_logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the depth decoder (scores for each vocabulary token before SoftMax). -
depth_decoder_past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) -
depth_decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
depth_decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
. -
backbone_loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for next-token prediction) of the backbone model.
The CsmForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> import torch
>>> from transformers import CsmForConditionalGeneration, AutoProcessor
>>> from datasets import load_dataset, Audio
>>> model_id = "eustlb/csm-1b"
>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
>>> # ensure the audio is 24kHz
>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
>>> conversation = []
>>> # prepare a conversation with text and corresponding audio
>>> for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
... conversation.append(
... {
... "role": f"{speaker_id}",
... "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
... }
... )
>>> inputs = processor.apply_chat_template(
... conversation,
... tokenize=True,
... return_dict=True,
... output_labels=True,
... ).to(torch_device)
>>> model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)
>>> output = model(**inputs)
>>> output.loss.backward()
generate
< source >( input_ids: typing.Optional[torch.Tensor] = None input_values: typing.Optional[torch.Tensor] = None input_values_cutoffs: typing.Optional[torch.Tensor] = None generation_config: typing.Optional[transformers.generation.configuration_utils.GenerationConfig] = None logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = None stopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = None synced_gpus: typing.Optional[bool] = None streamer: typing.Optional[ForwardRef('BaseStreamer')] = None output_audio: typing.Optional[bool] = False **kwargs ) → CsmGenerateOutput
or torch.LongTensor
or List[torch.FloatTensor]
Parameters
- inputs_ids (
torch.Tensor
of shape (batch_size, seq_length), optional) — The sequence used as a prompt for the backbone model. - input_values (
torch.Tensor
of shape (batch_size, channels, max_concatenated_audio_length), optional) — The batched audio input values, where each batch entry contains the concatenation of all audio segments for that entry. These values will be encoded into codebook tokens using the codec model and merged with the text input ids provided ininput_ids
. - input_values_cutoffs (
torch.Tensor
of shape (batch_size, max_num_audio), optional) — Specify the end positions of audio segments within each batch entry, relative to the concatenated audio input. If a batch entry has fewer segments than the maximum, it is padded with -1. For example, in a batch of 2 sequences where the first contains 2 audio segments of length l1, and the second contains 1 audio segment of length l2, the input_values_cutoffs would be: [[l1, 2 * l1], [l2, -1]]. - generation_config (GenerationConfig, optional) —
The generation configuration to be used as base parametrization for the generation call.
**kwargs
passed to generate matching the attributes ofgeneration_config
will override them. Ifgeneration_config
is not provided, the default will be used, which has the following loading priority: 1) from thegeneration_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation. - logits_processor (
LogitsProcessorList
, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. - stopping_criteria (
StoppingCriteriaList
, optional) — Custom stopping criteria that complements the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. If your stopping criteria depends on thescores
input, make sure you passreturn_dict_in_generate=True, output_scores=True
togenerate
. This feature is intended for advanced users. - synced_gpus (
bool
, optional) — Whether to continue running the while loop until max_length. Unless overridden, this flag will be set toTrue
if usingFullyShardedDataParallel
or DeepSpeed ZeRO Stage 3 with multiple GPUs to avoid deadlocking if one GPU finishes generating before other GPUs. Otherwise, defaults toFalse
. - streamer (
BaseStreamer
, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)
and the streamer is responsible for any further processing. - output_audio (
bool
, optional) — Whether to return the generated audio. - kwargs (
Dict[str, Any]
, optional) — Ad hoc parametrization ofgeneration_config
and/or additional model-specific kwargs that will be forwarded to theforward
function of the model. Depth decoder specific kwargs should be prefixed with depthdecoder.
Returns
CsmGenerateOutput
or torch.LongTensor
or List[torch.FloatTensor]
A CsmGenerateOutput
(if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.LongTensor
when output_audio=False
or a List[torch.FloatTensor]
otherwise.
This method overrides generate() to match the specifics of the Csm model. Indeed, Csm model requires a custom generation sampling step:
- Infer the backbone model to sample the first codebook token
- Call generate on the depth decoder with the first codebook token as
input_ids
to sample the next codebook tokens - Use these generated codebook tokens as
input_ids
to sample the next first codebook token using the backbone model - Repeat until stopping criteria is met
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, do_sample=True)
.
Example:
>>> from transformers import CsmProcessor, CsmForConditionalGeneration
>>> from datasets import load_dataset, Audio
>>> model_id = "eustlb/csm-1b"
>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
>>> # ensure the audio is 24kHz
>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
>>> conversation = []
>>> # prepare a conversation with text and corresponding audio
>>> for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
... conversation.append(
... {
... "role": f"{speaker_id}",
... "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
... }
... )
>>> # text prompt
>>> conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
>>> inputs = processor.apply_chat_template(
... conversation,
... tokenize=True,
... return_dict=True,
... ).to(torch_device)
>>> model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)
>>> audio = model.generate(**inputs, output_audio=True)
>>> processor.save_audio(audio, "output.wav")
CsmDepthDecoderForCausalLM
class transformers.CsmDepthDecoderForCausalLM
< source >( config )
Parameters
- config (CsmDepthDecoderConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The CsmDepthDecoder Model transformer, with a CsmCodebooksHead
on top,
which can be seen a position-specific language modeling head, allowing to use a different linear layer for each codebook
(e.g. position 0 is the first codebook and uses the first codebook head, etc.)
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None backbone_last_hidden_state: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.csm.modeling_csm.KwargsForCausalLM] ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
ortuple(tuple(torch.FloatTensor))
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
. - logits_to_keep (
int
ortorch.Tensor
, optional) — If anint
, compute logits for the lastlogits_to_keep
tokens. If0
, calculate logits for allinput_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor
, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (CsmConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The CsmDepthDecoderForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, CsmDepthDecoderForCausalLM
>>> model = CsmDepthDecoderForCausalLM.from_pretrained("meta-csm_depth_decoder/CsmDepthDecoder-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-csm_depth_decoder/CsmDepthDecoder-2-7b-hf")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
CsmDepthDecoderModel
class transformers.CsmDepthDecoderModel
< source >( config )
Parameters
- config (CsmDepthDecoderConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- config — CsmDepthDecoderConfig
The bare CsmDepthDecoderModel outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a CsmDecoderLayer
forward
< source >( input_ids: LongTensor = None backbone_last_hidden_state: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
ortuple(tuple(torch.FloatTensor))
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - backbone_last_hidden_state (
torch.FloatTensor
of shape(batch_size, backbone_hidden_size)
, optional) — The last hidden state of the backbone model. Such input is required when the first codebook token (the one generated by the backbone model) is provided in theinput_ids
argument.
The CsmDepthDecoderModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
CsmBackboneModel
class transformers.CsmBackboneModel
< source >( config )
Parameters
- config (CsmConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- config — CsmBackboneModelConfig
The bare CsmBackboneModel Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a CsmDecoderLayer
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length, num_codebooks) or (batch_size, sequence_length)
) —-
(batch_size, sequence_length): corresponds to the input sequence prepared with the processor from the text prompt. Such input requires
input_values
to be provided so that audio can be encoded in codebook tokens and then merged with the text tokens. -
(batch_size, sequence_length, num_codebooks): codebook tokens generated during the autoregressive decoding. Such input is not meant to be used by end users.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
-
- attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If
past_key_values
is used, optionally only the lastinput_ids
have to be input (seepast_key_values
).If you want to change padding behavior, you should read
modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
Cache
ortuple(tuple(torch.FloatTensor))
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
The CsmBackboneModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.