KerasHub
Divyasreepat commited on
Commit
c68c546
·
verified ·
1 Parent(s): a7c2def

Update README.md with new model card content

Browse files
Files changed (1) hide show
  1. README.md +265 -28
README.md CHANGED
@@ -1,31 +1,268 @@
1
  ---
2
  library_name: keras-hub
3
  ---
4
- This is a [`Moonshine` model](https://keras.io/api/keras_hub/models/moonshine) uploaded using the KerasHub library and can be used with JAX, TensorFlow, and PyTorch backends.
5
- This model is related to a `AudioToText` task.
6
-
7
- Model config:
8
- * **name:** moonshine_backbone_1
9
- * **trainable:** True
10
- * **vocabulary_size:** 32768
11
- * **filter_dim:** 288
12
- * **encoder_num_layers:** 6
13
- * **decoder_num_layers:** 6
14
- * **hidden_dim:** 288
15
- * **intermediate_dim:** 1152
16
- * **encoder_num_heads:** 8
17
- * **decoder_num_heads:** 8
18
- * **feedforward_expansion_factor:** 4
19
- * **encoder_use_swiglu_activation:** False
20
- * **decoder_use_swiglu_activation:** True
21
- * **max_position_embeddings:** 194
22
- * **pad_head_dim_to_multiple_of:** None
23
- * **partial_rotary_factor:** 0.9
24
- * **dropout:** 0.0
25
- * **initializer_range:** 0.02
26
- * **rope_theta:** 10000.0
27
- * **attention_bias:** False
28
- * **attention_dropout:** 0.0
29
- * **dtype:** float32
30
-
31
- This model card has been generated automatically and should be completed by the model author. See [Model Cards documentation](https://huggingface.co/docs/hub/model-cards) for more information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: keras-hub
3
  ---
4
+ ### Model Overview
5
+ # Model Summary
6
+
7
+ The Moonshine models are trained for the speech recognition task, capable of transcribing English speech audio into English text. Useful Sensors developed the models to support their business direction of developing real time speech transcription products based on low cost hardware. There are 2 models of different sizes and capabilities, summarized in the presets table.
8
+
9
+ Weights are released under the [MIT License](https://www.mit.edu/~amini/LICENSE.md) . Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).
10
+
11
+
12
+
13
+ ## Links
14
+
15
+ * [Moonshine Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/moonshine-quickstart-notebook)
16
+ * [Moonshine API Documentation](https://keras.io/keras_hub/api/models/moonshine/)
17
+ * [Moonshine Model Card](https://arxiv.org/abs/2410.15608)
18
+ * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
19
+ * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)
20
+
21
+ ## Installation
22
+
23
+ Keras and KerasHub can be installed with:
24
+
25
+ ```
26
+ pip install -U -q keras-hub
27
+ pip install -U -q keras
28
+ ```
29
+
30
+ Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
31
+
32
+ ## Presets
33
+
34
+ The following model checkpoints are provided by the Keras team. Full code examples for each are available below.
35
+
36
+ | Preset name | Parameters | Description |
37
+ |---------------------------------------|------------|--------------------------------------------------------------------------------------------------------------|
38
+ | moonshine_base_en | 61.5M | Moonshine base model for English speech recognition.Developed by Useful Sensors for real-time transcription.|
39
+ | moonshine_tiny_en | 27.1M | Moonshine tiny model for English speech recognition. Developed by Useful Sensors for real-time transcription. |
40
+
41
+ ## Example Usage
42
+ ```Python
43
+ import os
44
+
45
+ import keras
46
+ import keras_hub
47
+ import numpy as np
48
+ import librosa
49
+ import tensorflow as tf
50
+
51
+ from keras_hub.src.models.moonshine.moonshine_audio_to_text import (
52
+ MoonshineAudioToText,
53
+ )
54
+
55
+ # Custom backbone.
56
+ backbone = keras_hub.models.MoonshineBackbone(
57
+ vocabulary_size=10000,
58
+ filter_dim=256,
59
+ encoder_num_layers=6,
60
+ decoder_num_layers=6,
61
+ hidden_dim=256,
62
+ intermediate_dim=512,
63
+ encoder_num_heads=8,
64
+ decoder_num_heads=8,
65
+ feedforward_expansion_factor=4,
66
+ decoder_use_swiglu_activation=True,
67
+ encoder_use_swiglu_activation=False,
68
+ )
69
+ # Audio features as input (e.g., from MoonshineAudioConverter).
70
+ outputs = backbone(
71
+ {
72
+ "encoder_input_values": np.zeros((1, 16000, 1)),
73
+ "encoder_padding_mask": np.ones((1, 16000), dtype=bool),
74
+ "decoder_token_ids": np.zeros((1, 20), dtype=np.int32),
75
+ "decoder_padding_mask": np.ones((1, 20), dtype=bool),
76
+ }
77
+ )
78
+
79
+ # Config for test.
80
+ BATCH_SIZE = 2
81
+ AUDIO_PATH = "path/to/audio_file.wav"
82
+
83
+ # Load and prepare audio data.
84
+ audio, sr = librosa.load(AUDIO_PATH, sr=16000, mono=True)
85
+ audio_tensor = tf.expand_dims(audio, axis=-1)
86
+ audio_tensor = tf.convert_to_tensor(audio_tensor, dtype=tf.float32)
87
+ single_audio_input_batched = tf.expand_dims(audio_tensor, axis=0)
88
+ audio_batch = tf.repeat(single_audio_input_batched, BATCH_SIZE, axis=0)
89
+ dummy_texts = ["Sample transcription.", "Another sample transcription."]
90
+
91
+ # Create tf.data.Dataset.
92
+ audio_ds = tf.data.Dataset.from_tensor_slices(audio_batch)
93
+ text_ds = tf.data.Dataset.from_tensor_slices(dummy_texts)
94
+ audio_dataset = (
95
+ tf.data.Dataset.zip((audio_ds, text_ds))
96
+ .map(lambda audio, txt: {"audio": audio, "text": txt})
97
+ .batch(BATCH_SIZE)
98
+ )
99
+ print("Audio dataset created.")
100
+
101
+ # Load pretrained Moonshine model.
102
+ audio_to_text = MoonshineAudioToText.from_preset("moonshine_tiny_en")
103
+
104
+ # Generation examples.
105
+ generated_text_single = audio_to_text.generate(
106
+ {"audio": single_audio_input_batched}
107
+ )
108
+ print(f"Generated text (single audio): {generated_text_single}")
109
+
110
+ generated_text_batch = audio_to_text.generate({"audio": audio_batch})
111
+ print(f"Generated text (batch audio): {generated_text_batch}")
112
+
113
+ # Compile the generate() function with a custom sampler.
114
+ audio_to_text.compile(sampler="top_k")
115
+ generated_text_top_k = audio_to_text.generate(
116
+ {"audio": single_audio_input_batched}
117
+ )
118
+ print(f"Generated text (top_k sampler): {generated_text_top_k}")
119
+
120
+ audio_to_text.compile(sampler="greedy")
121
+ generated_text_greedy = audio_to_text.generate(
122
+ {"audio": single_audio_input_batched}
123
+ )
124
+ print(f"Generated text (greedy sampler): {generated_text_greedy}")
125
+
126
+ # Fine-tuning example.
127
+ audio_to_text.compile(
128
+ optimizer=keras.optimizers.Adam(learning_rate=1e-5),
129
+ loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
130
+ weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
131
+ )
132
+ history = audio_to_text.fit(audio_dataset, steps_per_epoch=1, epochs=1)
133
+ print(f"Fine-tuning completed. Training history: {history.history}")
134
+
135
+ # Detached preprocessing.
136
+ original_preprocessor = audio_to_text.preprocessor
137
+ audio_to_text.preprocessor = None
138
+ preprocessed_batch = original_preprocessor.generate_preprocess(
139
+ {"audio": audio_batch}
140
+ )
141
+ print(f"Preprocessed batch keys: {preprocessed_batch.keys()}")
142
+ stop_ids = (original_preprocessor.tokenizer.end_token_id,)
143
+ generated_batch_tokens = audio_to_text.generate(
144
+ preprocessed_batch, stop_token_ids=stop_ids
145
+ )
146
+ print(f"Generated tokens keys: {generated_batch_tokens.keys()}")
147
+ final_strings = original_preprocessor.generate_postprocess(
148
+ generated_batch_tokens
149
+ )
150
+ print(f"Final generated strings (detached): {final_strings}")
151
+ audio_to_text.preprocessor = original_preprocessor
152
+ print("Preprocessor reattached.")
153
+ ```
154
+
155
+ ## Example Usage with Hugging Face URI
156
+
157
+ ```Python
158
+ import os
159
+
160
+ import keras
161
+ import keras_hub
162
+ import numpy as np
163
+ import librosa
164
+ import tensorflow as tf
165
+
166
+ from keras_hub.src.models.moonshine.moonshine_audio_to_text import (
167
+ MoonshineAudioToText,
168
+ )
169
+
170
+ # Custom backbone.
171
+ backbone = keras_hub.models.MoonshineBackbone(
172
+ vocabulary_size=10000,
173
+ filter_dim=256,
174
+ encoder_num_layers=6,
175
+ decoder_num_layers=6,
176
+ hidden_dim=256,
177
+ intermediate_dim=512,
178
+ encoder_num_heads=8,
179
+ decoder_num_heads=8,
180
+ feedforward_expansion_factor=4,
181
+ decoder_use_swiglu_activation=True,
182
+ encoder_use_swiglu_activation=False,
183
+ )
184
+ # Audio features as input (e.g., from MoonshineAudioConverter).
185
+ outputs = backbone(
186
+ {
187
+ "encoder_input_values": np.zeros((1, 16000, 1)),
188
+ "encoder_padding_mask": np.ones((1, 16000), dtype=bool),
189
+ "decoder_token_ids": np.zeros((1, 20), dtype=np.int32),
190
+ "decoder_padding_mask": np.ones((1, 20), dtype=bool),
191
+ }
192
+ )
193
+
194
+ # Config for test.
195
+ BATCH_SIZE = 2
196
+ AUDIO_PATH = "path/to/audio_file.wav"
197
+
198
+ # Load and prepare audio data.
199
+ audio, sr = librosa.load(AUDIO_PATH, sr=16000, mono=True)
200
+ audio_tensor = tf.expand_dims(audio, axis=-1)
201
+ audio_tensor = tf.convert_to_tensor(audio_tensor, dtype=tf.float32)
202
+ single_audio_input_batched = tf.expand_dims(audio_tensor, axis=0)
203
+ audio_batch = tf.repeat(single_audio_input_batched, BATCH_SIZE, axis=0)
204
+ dummy_texts = ["Sample transcription.", "Another sample transcription."]
205
+
206
+ # Create tf.data.Dataset.
207
+ audio_ds = tf.data.Dataset.from_tensor_slices(audio_batch)
208
+ text_ds = tf.data.Dataset.from_tensor_slices(dummy_texts)
209
+ audio_dataset = (
210
+ tf.data.Dataset.zip((audio_ds, text_ds))
211
+ .map(lambda audio, txt: {"audio": audio, "text": txt})
212
+ .batch(BATCH_SIZE)
213
+ )
214
+ print("Audio dataset created.")
215
+
216
+ # Load pretrained Moonshine model.
217
+ audio_to_text = MoonshineAudioToText.from_preset("hf://keras/moonshine_tiny_en")
218
+
219
+ # Generation examples.
220
+ generated_text_single = audio_to_text.generate(
221
+ {"audio": single_audio_input_batched}
222
+ )
223
+ print(f"Generated text (single audio): {generated_text_single}")
224
+
225
+ generated_text_batch = audio_to_text.generate({"audio": audio_batch})
226
+ print(f"Generated text (batch audio): {generated_text_batch}")
227
+
228
+ # Compile the generate() function with a custom sampler.
229
+ audio_to_text.compile(sampler="top_k")
230
+ generated_text_top_k = audio_to_text.generate(
231
+ {"audio": single_audio_input_batched}
232
+ )
233
+ print(f"Generated text (top_k sampler): {generated_text_top_k}")
234
+
235
+ audio_to_text.compile(sampler="greedy")
236
+ generated_text_greedy = audio_to_text.generate(
237
+ {"audio": single_audio_input_batched}
238
+ )
239
+ print(f"Generated text (greedy sampler): {generated_text_greedy}")
240
+
241
+ # Fine-tuning example.
242
+ audio_to_text.compile(
243
+ optimizer=keras.optimizers.Adam(learning_rate=1e-5),
244
+ loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
245
+ weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
246
+ )
247
+ history = audio_to_text.fit(audio_dataset, steps_per_epoch=1, epochs=1)
248
+ print(f"Fine-tuning completed. Training history: {history.history}")
249
+
250
+ # Detached preprocessing.
251
+ original_preprocessor = audio_to_text.preprocessor
252
+ audio_to_text.preprocessor = None
253
+ preprocessed_batch = original_preprocessor.generate_preprocess(
254
+ {"audio": audio_batch}
255
+ )
256
+ print(f"Preprocessed batch keys: {preprocessed_batch.keys()}")
257
+ stop_ids = (original_preprocessor.tokenizer.end_token_id,)
258
+ generated_batch_tokens = audio_to_text.generate(
259
+ preprocessed_batch, stop_token_ids=stop_ids
260
+ )
261
+ print(f"Generated tokens keys: {generated_batch_tokens.keys()}")
262
+ final_strings = original_preprocessor.generate_postprocess(
263
+ generated_batch_tokens
264
+ )
265
+ print(f"Final generated strings (detached): {final_strings}")
266
+ audio_to_text.preprocessor = original_preprocessor
267
+ print("Preprocessor reattached.")
268
+ ```