reach-vb HF staff commited on
Commit
5c31a89
1 Parent(s): 654f042

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -152
README.md CHANGED
@@ -129,7 +129,7 @@ by Alec Radford et al. from OpenAI. The original code repository can be found [h
129
  The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
130
  The model was trained for 2.0 epochs over this mixture dataset.
131
 
132
- The `Whisper-large-v3 model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
133
 
134
 
135
  **Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
@@ -161,193 +161,201 @@ checkpoints are summarised in the following table with links to the models on th
161
  | large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
162
  | large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
163
 
164
- # Usage
165
 
166
- To transcribe audio samples, the model has to be used alongside a [`WhisperProcessor`](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperProcessor).
 
 
167
 
168
- The `WhisperProcessor` is used to:
169
- 1. Pre-process the audio inputs (converting them to log-Mel spectrograms for the model)
170
- 2. Post-process the model outputs (converting them from tokens to text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
- The model is informed of which task to perform (transcription or translation) by passing the appropriate "context tokens". These context tokens
173
- are a sequence of tokens that are given to the decoder at the start of the decoding process, and take the following order:
174
- 1. The transcription always starts with the `<|startoftranscript|>` token
175
- 2. The second token is the language token (e.g. `<|en|>` for English)
176
- 3. The third token is the "task token". It can take one of two values: `<|transcribe|>` for speech recognition or `<|translate|>` for speech translation
177
- 4. In addition, a `<|notimestamps|>` token is added if the model should not include timestamp prediction
 
 
 
178
 
179
- Thus, a typical sequence of context tokens might look as follows:
 
 
 
 
180
  ```
181
- <|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
 
 
 
 
182
  ```
183
- Which tells the model to decode in English, under the task of speech recognition, and not to predict timestamps.
184
 
185
- These tokens can either be forced or un-forced. If they are forced, the model is made to predict each token at
186
- each position. This allows one to control the output language and task for the Whisper model. If they are un-forced,
187
- the Whisper model will automatically predict the output langauge and task itself.
 
188
 
189
- The context tokens can be set accordingly:
190
 
191
  ```python
192
- model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")
193
- ```
 
194
 
195
- Which forces the model to predict in English under the task of speech recognition.
196
 
197
- ## Transcription
 
198
 
199
- ### English to English
200
- In this example, the context tokens are 'unforced', meaning the model automatically predicts the output language
201
- (English) and task (transcribe).
202
 
203
- ```python
204
- >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
205
- >>> from datasets import load_dataset
206
-
207
- >>> # load model and processor
208
- >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
209
- >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
210
- >>> model.config.forced_decoder_ids = None
211
-
212
- >>> # load dummy dataset and read audio files
213
- >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
214
- >>> sample = ds[0]["audio"]
215
- >>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
216
-
217
- >>> # generate token ids
218
- >>> predicted_ids = model.generate(input_features)
219
- >>> # decode token ids to text
220
- >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
221
- ['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']
222
-
223
- >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
224
- [' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
 
 
225
  ```
226
- The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.
227
 
228
- ### French to French
229
- The following example demonstrates French to French transcription by setting the decoder ids appropriately.
230
 
231
  ```python
232
- >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
233
- >>> from datasets import Audio, load_dataset
234
-
235
- >>> # load model and processor
236
- >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
237
- >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
238
- >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")
239
-
240
- >>> # load streaming dataset and read first audio sample
241
- >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
242
- >>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
243
- >>> input_speech = next(iter(ds))["audio"]
244
- >>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
245
-
246
- >>> # generate token ids
247
- >>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
248
- >>> # decode token ids to text
249
- >>> transcription = processor.batch_decode(predicted_ids)
250
- ['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
251
-
252
- >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
253
- [' Un vrai travail intéressant va enfin être mené sur ce sujet.']
254
  ```
 
255
 
256
- ## Translation
257
- Setting the task to "translate" forces the Whisper model to perform speech translation.
258
 
259
- ### French to English
 
 
 
 
 
260
 
261
  ```python
262
- >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
263
- >>> from datasets import Audio, load_dataset
264
-
265
- >>> # load model and processor
266
- >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
267
- >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
268
- >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")
269
-
270
- >>> # load streaming dataset and read first audio sample
271
- >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
272
- >>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
273
- >>> input_speech = next(iter(ds))["audio"]
274
- >>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
275
-
276
- >>> # generate token ids
277
- >>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
278
- >>> # decode token ids to text
279
- >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
280
- [' A very interesting work, we will finally be given on this subject.']
281
- ```
282
 
283
- ## Evaluation
 
284
 
285
- This code snippet shows how to evaluate Whisper Large on [LibriSpeech test-clean](https://huggingface.co/datasets/librispeech_asr):
286
-
287
- ```python
288
- >>> from datasets import load_dataset
289
- >>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
290
- >>> import torch
291
- >>> from evaluate import load
292
-
293
- >>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
294
-
295
- >>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
296
- >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to("cuda")
297
-
298
- >>> def map_to_pred(batch):
299
- >>> audio = batch["audio"]
300
- >>> input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
301
- >>> batch["reference"] = processor.tokenizer._normalize(batch['text'])
302
- >>>
303
- >>> with torch.no_grad():
304
- >>> predicted_ids = model.generate(input_features.to("cuda"))[0]
305
- >>> transcription = processor.decode(predicted_ids)
306
- >>> batch["prediction"] = processor.tokenizer._normalize(transcription)
307
- >>> return batch
308
-
309
- >>> result = librispeech_test_clean.map(map_to_pred)
310
-
311
- >>> wer = load("wer")
312
- >>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
313
- 3.0003583080317572
 
 
 
314
  ```
315
 
316
- ## Long-Form Transcription
317
 
318
- The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
319
- algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
320
- [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
321
- method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
322
- can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
323
 
324
- ```python
325
- >>> import torch
326
- >>> from transformers import pipeline
327
- >>> from datasets import load_dataset
328
 
329
- >>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
 
330
 
331
- >>> pipe = pipeline(
332
- >>> "automatic-speech-recognition",
333
- >>> model="openai/whisper-large-v2",
334
- >>> chunk_length_s=30,
335
- >>> device=device,
336
- >>> )
337
 
338
- >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
339
- >>> sample = ds[0]["audio"]
 
 
 
 
340
 
341
- >>> prediction = pipe(sample.copy(), batch_size=8)["text"]
342
- " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
343
 
344
- >>> # we can also return timestamps for the predictions
345
- >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
346
- [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
347
- 'timestamp': (0.0, 5.44)}]
 
348
  ```
349
 
350
- Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
 
 
 
 
 
351
 
352
  ## Fine-Tuning
353
 
 
129
  The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
130
  The model was trained for 2.0 epochs over this mixture dataset.
131
 
132
+ The `Whisper-large-v3` model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
133
 
134
 
135
  **Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
 
161
  | large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
162
  | large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
163
 
164
+ ## Usage
165
 
166
+ Whisper-large-v3 is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
167
+ install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
168
+ audio dataset from the Hugging Face Hub:
169
 
170
+ ```bash
171
+ pip install --upgrade pip
172
+ pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
173
+ ```
174
+
175
+ ### Short-Form Transcription
176
+
177
+ The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
178
+ class to transcribe short-form audio files (< 30-seconds) as follows:
179
+
180
+ ```python
181
+ import torch
182
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
183
+ from datasets import load_dataset
184
+
185
+
186
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
187
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
188
+
189
+ model_id = "openai/Whisper-large-v3"
190
+
191
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
192
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
193
+ )
194
+ model.to(device)
195
+
196
+ processor = AutoProcessor.from_pretrained(model_id)
197
 
198
+ pipe = pipeline(
199
+ "automatic-speech-recognition",
200
+ model=model,
201
+ tokenizer=processor.tokenizer,
202
+ feature_extractor=processor.feature_extractor,
203
+ max_new_tokens=128,
204
+ torch_dtype=torch_dtype,
205
+ device=device,
206
+ )
207
 
208
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
209
+ sample = dataset[0]["audio"]
210
+
211
+ result = pipe(sample)
212
+ print(result["text"])
213
  ```
214
+
215
+ To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
216
+ ```diff
217
+ - result = pipe(sample)
218
+ + result = pipe("audio.mp3")
219
  ```
 
220
 
221
+ ### Long-Form Transcription
222
+
223
+ Through Transformers Whisper-large-v3 uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
224
+ is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
225
 
226
+ To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
227
 
228
  ```python
229
+ import torch
230
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
231
+ from datasets import load_dataset
232
 
 
233
 
234
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
235
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
236
 
237
+ model_id = "openai/Whisper-large-v3"
 
 
238
 
239
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
240
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
241
+ )
242
+ model.to(device)
243
+
244
+ processor = AutoProcessor.from_pretrained(model_id)
245
+
246
+ pipe = pipeline(
247
+ "automatic-speech-recognition",
248
+ model=model,
249
+ tokenizer=processor.tokenizer,
250
+ feature_extractor=processor.feature_extractor,
251
+ max_new_tokens=128,
252
+ chunk_length_s=15,
253
+ batch_size=16,
254
+ torch_dtype=torch_dtype,
255
+ device=device,
256
+ )
257
+
258
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
259
+ sample = dataset[0]["audio"]
260
+
261
+ result = pipe(sample)
262
+ print(result["text"])
263
  ```
 
264
 
265
+ <!---
266
+ **Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
267
 
268
  ```python
269
+ result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
  ```
271
+ --->
272
 
273
+ ### Speculative Decoding
 
274
 
275
+ [Distil-Whisper](https://hf.co/distil-whisper/large-v2) can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
276
+ ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
277
+ replacement for existing Whisper pipelines, since the same outputs are guaranteed.
278
+
279
+ In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
280
+ specify it as the "assistant model" for generation:
281
 
282
  ```python
283
+ from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
284
+ import torch
285
+ from datasets import load_dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
286
 
287
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
288
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
289
 
290
+ assistant_model_id = "distil-whisper/distil-large-v2"
291
+
292
+ assistant_model = AutoModelForCausalLM.from_pretrained(
293
+ assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
294
+ )
295
+ assistant_model.to(device)
296
+
297
+ model_id = "openai/whisper-large-v3"
298
+
299
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
300
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
301
+ )
302
+ model.to(device)
303
+
304
+ processor = AutoProcessor.from_pretrained(model_id)
305
+
306
+ pipe = pipeline(
307
+ "automatic-speech-recognition",
308
+ model=model,
309
+ tokenizer=processor.tokenizer,
310
+ feature_extractor=processor.feature_extractor,
311
+ max_new_tokens=128,
312
+ generate_kwargs={"assistant_model": assistant_model},
313
+ torch_dtype=torch_dtype,
314
+ device=device,
315
+ )
316
+
317
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
318
+ sample = dataset[0]["audio"]
319
+
320
+ result = pipe(sample)
321
+ print(result["text"])
322
  ```
323
 
324
+ ## Additional Speed & Memory Improvements
325
 
326
+ You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
 
 
 
 
327
 
328
+ ### Flash Attention
 
 
 
329
 
330
+ We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
331
+ To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
332
 
333
+ ```
334
+ pip install flash-attn --no-build-isolation
335
+ ```
 
 
 
336
 
337
+ and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
338
+
339
+ ```diff
340
+ - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
341
+ + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
342
+ ```
343
 
344
+ ### Torch Scale-Product-Attention (SDPA)
 
345
 
346
+ If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
347
+ To do so, you first need to install optimum:
348
+
349
+ ```
350
+ pip install --upgrade optimum
351
  ```
352
 
353
+ And then convert your model to a "BetterTransformer" model before using it:
354
+
355
+ ```diff
356
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
357
+ + model = model.to_bettertransformer()
358
+ ```
359
 
360
  ## Fine-Tuning
361