Yingxu He commited on
Commit
385b92d
·
verified ·
1 Parent(s): f4991de

Upload config

Browse files
Files changed (3) hide show
  1. README.md +199 -0
  2. config.json +60 -0
  3. configuration_meralion.py +518 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoConfig": "configuration_meralion.MERaLiONConfig"
4
+ },
5
+ "head_dim": 256,
6
+ "hidden_size": 3584,
7
+ "intermediate_size": 14336,
8
+ "model_type": "meralion",
9
+ "num_attention_heads": 16,
10
+ "num_hidden_layers": 42,
11
+ "num_key_value_heads": 8,
12
+ "sliding_window": 4096,
13
+ "speech_config": {
14
+ "_name_or_path": "openai/whisper-large-v3",
15
+ "apply_spec_augment": true,
16
+ "architectures": [
17
+ "WhisperForConditionalGeneration"
18
+ ],
19
+ "begin_suppress_tokens": [
20
+ 220,
21
+ 50257
22
+ ],
23
+ "bos_token_id": 50257,
24
+ "d_model": 1280,
25
+ "decoder_attention_heads": 20,
26
+ "decoder_ffn_dim": 5120,
27
+ "decoder_layers": 32,
28
+ "decoder_start_token_id": 50258,
29
+ "encoder_attention_heads": 20,
30
+ "encoder_ffn_dim": 5120,
31
+ "encoder_layers": 32,
32
+ "eos_token_id": 50257,
33
+ "mask_time_length": 20,
34
+ "max_length": 448,
35
+ "model_type": "meralion_speech_encoder",
36
+ "num_hidden_layers": 32,
37
+ "num_mel_bins": 128,
38
+ "torch_dtype": "bfloat16",
39
+ "vocab_size": 51866
40
+ },
41
+ "speech_mlp_scale_factor": 15,
42
+ "speech_token_index": 255999,
43
+ "text_config": {
44
+ "_name_or_path": "aisingapore/gemma2-9b-cpt-sea-lionv3-instruct",
45
+ "architectures": [
46
+ "Gemma2ForCausalLM"
47
+ ],
48
+ "eos_token_id": 107,
49
+ "hidden_act": "gelu_pytorch_tanh",
50
+ "hidden_size": 3584,
51
+ "intermediate_size": 14336,
52
+ "model_type": "meralion_text_decoder",
53
+ "num_hidden_layers": 42,
54
+ "num_key_value_heads": 8,
55
+ "query_pre_attn_scalar": 256,
56
+ "sliding_window_size": 4096,
57
+ "torch_dtype": "bfloat16"
58
+ },
59
+ "transformers_version": "4.44.2"
60
+ }
configuration_meralion.py ADDED
@@ -0,0 +1,518 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Microsoft Research & University of Wisconsin-Madison and the HuggingFace Inc. team. All rights reserved.
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """MERaLiON model configuration"""
15
+
16
+ from collections import OrderedDict
17
+ from typing import TYPE_CHECKING, Any, Mapping, Optional, Union
18
+
19
+ from transformers.configuration_utils import PretrainedConfig
20
+ from transformers.onnx import OnnxConfig
21
+ from transformers.utils import logging
22
+
23
+
24
+ if TYPE_CHECKING:
25
+ from transformers.feature_extraction_utils import FeatureExtractionMixin
26
+ from transformers.tokenization_utils_base import PreTrainedTokenizerBase
27
+ from transformers.utils import TensorType
28
+
29
+
30
+ logger = logging.get_logger(__name__)
31
+
32
+
33
+ # fmt: off
34
+ NON_SPEECH_TOKENS = [
35
+ 1, 2, 7, 8, 9, 10, 14, 25,
36
+ 26, 27, 28, 29, 31, 58, 59, 60, 61, 62,
37
+ 63, 90, 91, 92, 93, 357, 366, 438, 532, 685,
38
+ 705, 796, 930, 1058, 1220, 1267, 1279, 1303, 1343, 1377,
39
+ 1391, 1635, 1782, 1875, 2162, 2361, 2488, 3467, 4008, 4211,
40
+ 4600, 4808, 5299, 5855, 6329, 7203, 9609, 9959, 10563, 10786,
41
+ 11420, 11709, 11907, 13163, 13697, 13700, 14808, 15306, 16410, 16791,
42
+ 17992, 19203, 19510, 20724, 22305, 22935, 27007, 30109, 30420, 33409,
43
+ 34949, 40283, 40493, 40549, 47282, 49146, 50257, 50359, 50360, 50361
44
+ ]
45
+ NON_SPEECH_TOKENS_MULTI = [
46
+ 1, 2, 7, 8, 9, 10, 14, 25,
47
+ 26, 27, 28, 29, 31, 58, 59, 60, 61, 62,
48
+ 63, 90, 91, 92, 93, 359, 503, 522, 542, 873,
49
+ 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627,
50
+ 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647,
51
+ 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793,
52
+ 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675,
53
+ 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865,
54
+ 42863, 47425, 49870, 50254, 50258, 50360, 50361, 50362
55
+ ]
56
+ # fmt: on
57
+
58
+ # Copied from transformers.models.whisper.configuration_whisper.WhisperConfig
59
+ class MERaLiONSpeechConfig(PretrainedConfig):
60
+ r"""
61
+ This is the configuration class to store the configuration of a [`MERaLiONSpeechModel`]. It is used to instantiate a
62
+ MERaLiONSpeech model according to the specified arguments, defining the model architecture. Instantiating a configuration
63
+ with the defaults will yield a similar configuration to that of the MERaLiONSpeech
64
+ [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) architecture.
65
+
66
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
67
+ documentation from [`PretrainedConfig`] for more information.
68
+
69
+
70
+ Args:
71
+ vocab_size (`int`, *optional*, defaults to 51865):
72
+ Vocabulary size of the MERaLiONSpeech model. Defines the number of different tokens that can be represented by the
73
+ `decoder_input_ids` passed when calling [`MERaLiONSpeechModel`]
74
+ num_mel_bins (`int`, *optional*, defaults to 80):
75
+ Number of mel features used per input features. Should correspond to the value used in the
76
+ `MERaLiONSpeechProcessor` class.
77
+ encoder_layers (`int`, *optional*, defaults to 4):
78
+ Number of encoder layers.
79
+ decoder_layers (`int`, *optional*, defaults to 4):
80
+ Number of decoder layers.
81
+ encoder_attention_heads (`int`, *optional*, defaults to 6):
82
+ Number of attention heads for each attention layer in the Transformer encoder.
83
+ decoder_attention_heads (`int`, *optional*, defaults to 6):
84
+ Number of attention heads for each attention layer in the Transformer decoder.
85
+ encoder_ffn_dim (`int`, *optional*, defaults to 1536):
86
+ Dimensionality of the "intermediate" (often named feed-forward) layer in encoder.
87
+ decoder_ffn_dim (`int`, *optional*, defaults to 1536):
88
+ Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
89
+ encoder_layerdrop (`float`, *optional*, defaults to 0.0):
90
+ The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
91
+ for more details.
92
+ decoder_layerdrop (`float`, *optional*, defaults to 0.0):
93
+ The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
94
+ for more details.
95
+ decoder_start_token_id (`int`, *optional*, defaults to 50257):
96
+ Corresponds to the "<|startoftranscript|>" token, which is automatically used when no `decoder_input_ids`
97
+ are provided to the `generate` function. It is used to guide the model`s generation process depending on
98
+ the task.
99
+ use_cache (`bool`, *optional*, defaults to `True`):
100
+ Whether or not the model should return the last key/values attentions (not used by all models).
101
+ is_encoder_decoder (`bool`, *optional*, defaults to `True`):
102
+ Whether the model is used as an encoder/decoder or not.
103
+ activation_function (`str`, *optional*, defaults to `"gelu"`):
104
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
105
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
106
+ d_model (`int`, *optional*, defaults to 384):
107
+ Dimensionality of the layers.
108
+ dropout (`float`, *optional*, defaults to 0.1):
109
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
110
+ attention_dropout (`float`, *optional*, defaults to 0.0):
111
+ The dropout ratio for the attention probabilities.
112
+ activation_dropout (`float`, *optional*, defaults to 0.0):
113
+ The dropout ratio for activations inside the fully connected layer.
114
+ init_std (`float`, *optional*, defaults to 0.02):
115
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
116
+ scale_embedding (`bool`, *optional*, defaults to False):
117
+ Scale embeddings by diving by sqrt(d_model).
118
+ max_source_positions (`int`, *optional*, defaults to 1500):
119
+ The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
120
+ max_target_positions (`int`, *optional*, defaults to 448):
121
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
122
+ just in case (e.g., 512 or 1024 or 2048).
123
+ pad_token_id (`int`, *optional*, defaults to 50256):
124
+ Padding token id.
125
+ bos_token_id (`int`, *optional*, defaults to 50256):
126
+ Begin of stream token id.
127
+ eos_token_id (`int`, *optional*, defaults to 50256):
128
+ End of stream token id.
129
+ suppress_tokens (`List[int]`, *optional*):
130
+ A list containing the non-speech tokens that will be used by the logit processor in the `generate`
131
+ function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the `english-only` and the
132
+ `multilingual` model.
133
+ begin_suppress_tokens (`List[int]`, *optional*, defaults to `[220,50256]`):
134
+ A list containing tokens that will be supressed at the beginning of the sampling process. Initialized as
135
+ the token for `" "` (`blank_token_id`) and the `eos_token_id`
136
+ use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
137
+ Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
138
+ instance of [`MERaLiONSpeechForAudioClassification`].
139
+ classifier_proj_size (`int`, *optional*, defaults to 256):
140
+ Dimensionality of the projection before token mean-pooling for classification. Only relevant when using an
141
+ instance of [`MERaLiONSpeechForAudioClassification`].
142
+ apply_spec_augment (`bool`, *optional*, defaults to `False`):
143
+ Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see
144
+ [SpecAugment: A Simple Data Augmentation Method for Automatic Speech
145
+ Recognition](https://arxiv.org/abs/1904.08779).
146
+ mask_time_prob (`float`, *optional*, defaults to 0.05):
147
+ Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking
148
+ procecure generates `mask_time_prob*len(time_axis)/mask_time_length` independent masks over the axis. If
149
+ reasoning from the propability of each feature vector to be chosen as the start of the vector span to be
150
+ masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the
151
+ actual percentage of masked vectors. This is only relevant if `apply_spec_augment == True`.
152
+ mask_time_length (`int`, *optional*, defaults to 10):
153
+ Length of vector span along the time axis.
154
+ mask_time_min_masks (`int`, *optional*, defaults to 2),:
155
+ The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
156
+ irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
157
+ mask_time_min_masks''
158
+ mask_feature_prob (`float`, *optional*, defaults to 0.0):
159
+ Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The
160
+ masking procecure generates `mask_feature_prob*len(feature_axis)/mask_time_length` independent masks over
161
+ the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector
162
+ span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap
163
+ may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is
164
+ True`.
165
+ mask_feature_length (`int`, *optional*, defaults to 10):
166
+ Length of vector span along the feature axis.
167
+ mask_feature_min_masks (`int`, *optional*, defaults to 0),:
168
+ The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time
169
+ step, irrespectively of `mask_feature_prob`. Only relevant if
170
+ `mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks`.
171
+ median_filter_width (`int`, *optional*, defaults to 7):
172
+ Width of the median filter used to smoothen to cross-attention outputs when computing token timestamps.
173
+ Should be an odd number.
174
+ """
175
+
176
+ model_type = "meralion_speech_encoder"
177
+ keys_to_ignore_at_inference = ["past_key_values"]
178
+ attribute_map = {
179
+ "num_key_value_heads": "encoder_attention_heads",
180
+ "num_attention_heads": "encoder_attention_heads",
181
+ "hidden_size": "d_model",
182
+ }
183
+
184
+ def __init__(
185
+ self,
186
+ vocab_size=51865,
187
+ num_mel_bins=80,
188
+ encoder_layers=4,
189
+ encoder_attention_heads=6,
190
+ decoder_layers=4,
191
+ decoder_attention_heads=6,
192
+ decoder_ffn_dim=1536,
193
+ encoder_ffn_dim=1536,
194
+ encoder_layerdrop=0.0,
195
+ decoder_layerdrop=0.0,
196
+ decoder_start_token_id=50257,
197
+ use_cache=True,
198
+ is_encoder_decoder=True,
199
+ activation_function="gelu",
200
+ d_model=384,
201
+ dropout=0.0,
202
+ attention_dropout=0.0,
203
+ activation_dropout=0.0,
204
+ init_std=0.02,
205
+ scale_embedding=False,
206
+ max_source_positions=1500,
207
+ max_target_positions=448,
208
+ pad_token_id=50256,
209
+ bos_token_id=50256,
210
+ eos_token_id=50256,
211
+ suppress_tokens=None,
212
+ begin_suppress_tokens=[220, 50256],
213
+ use_weighted_layer_sum=False,
214
+ classifier_proj_size=256,
215
+ apply_spec_augment=False,
216
+ mask_time_prob=0.05,
217
+ mask_time_length=10,
218
+ mask_time_min_masks=2,
219
+ mask_feature_prob=0.0,
220
+ mask_feature_length=10,
221
+ mask_feature_min_masks=0,
222
+ median_filter_width=7,
223
+ **kwargs,
224
+ ):
225
+ self.vocab_size = vocab_size
226
+ self.num_mel_bins = num_mel_bins
227
+ self.d_model = d_model
228
+ self.encoder_layers = encoder_layers
229
+ self.encoder_attention_heads = encoder_attention_heads
230
+ self.decoder_layers = decoder_layers
231
+ self.decoder_attention_heads = decoder_attention_heads
232
+ self.decoder_ffn_dim = decoder_ffn_dim
233
+ self.encoder_ffn_dim = encoder_ffn_dim
234
+ self.dropout = dropout
235
+ self.attention_dropout = attention_dropout
236
+ self.activation_dropout = activation_dropout
237
+ self.activation_function = activation_function
238
+ self.init_std = init_std
239
+ self.encoder_layerdrop = encoder_layerdrop
240
+ self.decoder_layerdrop = decoder_layerdrop
241
+ self.use_cache = use_cache
242
+ self.num_hidden_layers = encoder_layers
243
+ self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
244
+ self.max_source_positions = max_source_positions
245
+ self.max_target_positions = max_target_positions
246
+
247
+ # Audio Classification-specific parameters. Feel free to ignore for other classes.
248
+ self.classifier_proj_size = classifier_proj_size
249
+ self.use_weighted_layer_sum = use_weighted_layer_sum
250
+
251
+ # fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
252
+ self.apply_spec_augment = apply_spec_augment
253
+ self.mask_time_prob = mask_time_prob
254
+ self.mask_time_length = mask_time_length
255
+ self.mask_time_min_masks = mask_time_min_masks
256
+ self.mask_feature_prob = mask_feature_prob
257
+ self.mask_feature_length = mask_feature_length
258
+ self.mask_feature_min_masks = mask_feature_min_masks
259
+
260
+ self.median_filter_width = median_filter_width
261
+
262
+ super().__init__(
263
+ pad_token_id=pad_token_id,
264
+ bos_token_id=bos_token_id,
265
+ eos_token_id=eos_token_id,
266
+ is_encoder_decoder=is_encoder_decoder,
267
+ decoder_start_token_id=decoder_start_token_id,
268
+ suppress_tokens=suppress_tokens,
269
+ begin_suppress_tokens=begin_suppress_tokens,
270
+ **kwargs,
271
+ )
272
+ @property
273
+ def inputs(self) -> Mapping[str, Mapping[int, str]]:
274
+ common_inputs = OrderedDict(
275
+ [
276
+ ("input_features", {0: "batch", 1: "feature_size", 2: "encoder_sequence"}),
277
+ ]
278
+ )
279
+ if self.use_past:
280
+ common_inputs["decoder_input_ids"] = {0: "batch"}
281
+ else:
282
+ common_inputs["decoder_input_ids"] = {0: "batch", 1: "decoder_sequence"}
283
+
284
+ if self.use_past:
285
+ self.fill_with_past_key_values_(common_inputs, direction="inputs")
286
+
287
+ return common_inputs
288
+
289
+ def generate_dummy_inputs(
290
+ self,
291
+ preprocessor: Union["PreTrainedTokenizerBase", "FeatureExtractionMixin"],
292
+ batch_size: int = -1,
293
+ seq_length: int = -1,
294
+ is_pair: bool = False,
295
+ framework: Optional["TensorType"] = None,
296
+ sampling_rate: int = 22050,
297
+ time_duration: float = 5.0,
298
+ frequency: int = 220,
299
+ ) -> Mapping[str, Any]:
300
+ dummy_inputs = OrderedDict()
301
+ encoder_inputs = OnnxConfig.generate_dummy_inputs(
302
+ self,
303
+ preprocessor=preprocessor.feature_extractor,
304
+ batch_size=batch_size,
305
+ framework=framework,
306
+ sampling_rate=sampling_rate,
307
+ time_duration=time_duration,
308
+ frequency=frequency,
309
+ )
310
+ encoder_sequence_length = encoder_inputs["input_features"].shape[2]
311
+ seq_length = encoder_sequence_length // 2 if self.use_past else seq_length
312
+
313
+ decoder_inputs = super().generate_dummy_inputs(
314
+ preprocessor.tokenizer, batch_size, seq_length, is_pair, framework
315
+ )
316
+
317
+ dummy_inputs["input_features"] = encoder_inputs.pop("input_features")
318
+ dummy_inputs["decoder_input_ids"] = decoder_inputs.pop("decoder_input_ids")
319
+
320
+ if "past_key_values" in decoder_inputs:
321
+ dummy_inputs["past_key_values"] = decoder_inputs.pop("past_key_values")
322
+
323
+ return dummy_inputs
324
+
325
+ @property
326
+ def atol_for_validation(self) -> float:
327
+ return 1e-3
328
+
329
+
330
+ # Copied from transformers.models.gemma2.configuration_gemma2.Gemma2Config
331
+ class MERaLiONTextConfig(PretrainedConfig):
332
+ r"""
333
+ This is the configuration class to store the configuration of a [`MERaLiONTextModel`]. It is used to instantiate an MERaLiONText
334
+ model according to the specified arguments, defining the model architecture.
335
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
336
+ documentation from [`PretrainedConfig`] for more information.
337
+ Args:
338
+ vocab_size (`int`, *optional*, defaults to 256000):
339
+ Vocabulary size of the MERaLiONText model. Defines the number of different tokens that can be represented by the
340
+ `inputs_ids` passed when calling [`MERaLiONTextModel`]
341
+ hidden_size (`int`, *optional*, defaults to 3072):
342
+ Dimension of the hidden representations.
343
+ intermediate_size (`int`, *optional*, defaults to 24576):
344
+ Dimension of the MLP representations.
345
+ num_hidden_layers (`int`, *optional*, defaults to 28):
346
+ Number of hidden layers in the Transformer decoder.
347
+ num_attention_heads (`int`, *optional*, defaults to 16):
348
+ Number of attention heads for each attention layer in the Transformer decoder.
349
+ num_key_value_heads (`int`, *optional*, defaults to 16):
350
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
351
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
352
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
353
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
354
+ by meanpooling all the original heads within that group. For more details checkout [this
355
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
356
+ `num_attention_heads`.
357
+ head_dim (`int`, *optional*, defaults to 256):
358
+ The attention head dimension.
359
+ hidden_activation (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
360
+ The non-linear activation function (function or string) in the decoder. Will default to `"gelu_pytorch_tanh"`
361
+ if not specified. `"gelu_pytorch_tanh"` uses an approximation of the `"gelu"` activation function.
362
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
363
+ The maximum sequence length that this model might ever be used with.
364
+ initializer_range (`float`, *optional*, defaults to 0.02):
365
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
366
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
367
+ The epsilon used by the rms normalization layers.
368
+ use_cache (`bool`, *optional*, defaults to `True`):
369
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
370
+ relevant if `config.is_decoder=True`.
371
+ pad_token_id (`int`, *optional*, defaults to 0):
372
+ Padding token id.
373
+ eos_token_id (`int`, *optional*, defaults to 1):
374
+ End of stream token id.
375
+ bos_token_id (`int`, *optional*, defaults to 2):
376
+ Beginning of stream token id.
377
+ tie_word_embeddings (`bool`, *optional*, defaults to `True`):
378
+ Whether to tie weight embeddings
379
+ rope_theta (`float`, *optional*, defaults to 10000.0):
380
+ The base period of the RoPE embeddings.
381
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
382
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
383
+ attention_dropout (`float`, *optional*, defaults to 0.0):
384
+ The dropout ratio for the attention probabilities.
385
+ query_pre_attn_scalar (`float`, *optional*, defaults to 224): scaling factor used on the attention scores
386
+ sliding_window (`int`, *optional*, defaults to 4096): in MERaLiONText, every other layer uses sliding window attention. This is the
387
+ size of the sliding window.
388
+ final_logit_softcapping (`float`, *optional*, defaults to 30.0): scaling factor when applying tanh softcapping on the logits.
389
+ attn_logit_softcapping (`float`, *optional*, defaults to 50.0): scaling factor when applying tanh softcapping on the attention scores.
390
+ cache_implementation (`str`, *optional*, defaults to `"hybrid"`): the cache type to be used with `generate`.
391
+ """
392
+
393
+ model_type = "meralion_text_decoder"
394
+ keys_to_ignore_at_inference = ["past_key_values"]
395
+
396
+ def __init__(
397
+ self,
398
+ vocab_size=256000,
399
+ hidden_size=3072,
400
+ intermediate_size=24576,
401
+ num_hidden_layers=28,
402
+ num_attention_heads=16,
403
+ num_key_value_heads=16,
404
+ head_dim=256,
405
+ hidden_activation="gelu_pytorch_tanh",
406
+ max_position_embeddings=8192,
407
+ initializer_range=0.02,
408
+ rms_norm_eps=1e-6,
409
+ use_cache=True,
410
+ pad_token_id=0,
411
+ eos_token_id=1,
412
+ bos_token_id=2,
413
+ tie_word_embeddings=True,
414
+ rope_theta=10000.0,
415
+ attention_bias=False,
416
+ attention_dropout=0.0,
417
+ query_pre_attn_scalar=224,
418
+ sliding_window=4096,
419
+ final_logit_softcapping=30.0,
420
+ attn_logit_softcapping=50.0,
421
+ cache_implementation="hybrid",
422
+ **kwargs,
423
+ ):
424
+ super().__init__(
425
+ pad_token_id=pad_token_id,
426
+ bos_token_id=bos_token_id,
427
+ eos_token_id=eos_token_id,
428
+ tie_word_embeddings=tie_word_embeddings,
429
+ **kwargs,
430
+ )
431
+ self.vocab_size = vocab_size
432
+ self.max_position_embeddings = max_position_embeddings
433
+ self.hidden_size = hidden_size
434
+ self.intermediate_size = intermediate_size
435
+ self.num_hidden_layers = num_hidden_layers
436
+ self.num_attention_heads = num_attention_heads
437
+ self.head_dim = head_dim
438
+ self.num_key_value_heads = num_key_value_heads
439
+ self.initializer_range = initializer_range
440
+ self.rms_norm_eps = rms_norm_eps
441
+ self.use_cache = use_cache
442
+ self.rope_theta = rope_theta
443
+ self.attention_bias = attention_bias
444
+ self.attention_dropout = attention_dropout
445
+ self.hidden_activation = hidden_activation
446
+ self.query_pre_attn_scalar = query_pre_attn_scalar
447
+ self.sliding_window = sliding_window
448
+ self.final_logit_softcapping = final_logit_softcapping
449
+ self.attn_logit_softcapping = attn_logit_softcapping
450
+ self.cache_implementation = cache_implementation
451
+
452
+
453
+ class MERaLiONConfig(PretrainedConfig):
454
+ r"""
455
+ This is the configuration class to store the configuration of a [`MERaLiONForConditionalGeneration`]. It is used to instantiate an
456
+ MERaLiON model according to the specified arguments, defining the model architecture. Instantiating a configuration
457
+ with the defaults will yield a similar configuration to that of the MERaLiON.
458
+
459
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
460
+ documentation from [`PretrainedConfig`] for more information.
461
+
462
+ Args:
463
+ audio_config (`Union[AutoConfig, dict]`, *optional*, defaults to `CLIPVisionConfig`):
464
+ The config object or dictionary of the audio backbone.
465
+ text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
466
+ The config object or dictionary of the text backbone.
467
+ audio_token_index (`int`, *optional*, defaults to 151646):
468
+ The image token index to encode the image prompt.
469
+ """
470
+
471
+ model_type = "meralion"
472
+ is_composition = False
473
+
474
+ def __init__(
475
+ self,
476
+ speech_config=None,
477
+ text_config=None,
478
+ speech_mlp_scale_factor=15,
479
+ speech_token_index=255999,
480
+ **kwargs,
481
+ ):
482
+
483
+ if isinstance(speech_config, dict):
484
+ speech_config = MERaLiONSpeechConfig(**speech_config)
485
+ elif speech_config is None:
486
+ speech_config = MERaLiONSpeechConfig(
487
+ d_model=1280,
488
+ encoder_attention_heads=20,
489
+ encoder_ffn_dim=5120,
490
+ encoder_layerdrop=0.0,
491
+ encoder_layers=32,
492
+ num_mel_bins=128,
493
+ max_source_positions=1500,
494
+ scale_embedding=False,
495
+ activation_function="gelu",
496
+ )
497
+
498
+ self.speech_config = speech_config
499
+
500
+ if isinstance(text_config, dict):
501
+ text_config = MERaLiONTextConfig(**text_config)
502
+ elif text_config is None:
503
+ text_config = MERaLiONTextConfig()
504
+
505
+ self.text_config = text_config
506
+
507
+ self.speech_mlp_scale_factor = speech_mlp_scale_factor
508
+ self.speech_token_index = speech_token_index
509
+
510
+ self.sliding_window = self.text_config.sliding_window
511
+ self.hidden_size = self.text_config.hidden_size
512
+ self.num_attention_heads = self.text_config.num_attention_heads
513
+ self.num_hidden_layers = self.text_config.num_hidden_layers
514
+ self.num_key_value_heads = self.text_config.num_key_value_heads
515
+ self.head_dim = self.text_config.head_dim
516
+ self.intermediate_size = self.text_config.intermediate_size
517
+
518
+ super().__init__(**kwargs)