Update README.md
Browse files
README.md
CHANGED
@@ -129,7 +129,7 @@ by Alec Radford et al. from OpenAI. The original code repository can be found [h
|
|
129 |
The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
|
130 |
The model was trained for 2.0 epochs over this mixture dataset.
|
131 |
|
132 |
-
The `Whisper-large-v3 model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
|
133 |
|
134 |
|
135 |
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
|
@@ -161,193 +161,201 @@ checkpoints are summarised in the following table with links to the models on th
|
|
161 |
| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
|
162 |
| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
|
163 |
|
164 |
-
|
165 |
|
166 |
-
|
|
|
|
|
167 |
|
168 |
-
|
169 |
-
|
170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
|
|
|
|
|
|
178 |
|
179 |
-
|
|
|
|
|
|
|
|
|
180 |
```
|
181 |
-
|
|
|
|
|
|
|
|
|
182 |
```
|
183 |
-
Which tells the model to decode in English, under the task of speech recognition, and not to predict timestamps.
|
184 |
|
185 |
-
|
186 |
-
|
187 |
-
|
|
|
188 |
|
189 |
-
|
190 |
|
191 |
```python
|
192 |
-
|
193 |
-
|
|
|
194 |
|
195 |
-
Which forces the model to predict in English under the task of speech recognition.
|
196 |
|
197 |
-
|
|
|
198 |
|
199 |
-
|
200 |
-
In this example, the context tokens are 'unforced', meaning the model automatically predicts the output language
|
201 |
-
(English) and task (transcribe).
|
202 |
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
|
224 |
-
|
|
|
|
|
225 |
```
|
226 |
-
The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.
|
227 |
|
228 |
-
|
229 |
-
The
|
230 |
|
231 |
```python
|
232 |
-
|
233 |
-
>>> from datasets import Audio, load_dataset
|
234 |
-
|
235 |
-
>>> # load model and processor
|
236 |
-
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
|
237 |
-
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
|
238 |
-
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")
|
239 |
-
|
240 |
-
>>> # load streaming dataset and read first audio sample
|
241 |
-
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
|
242 |
-
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
|
243 |
-
>>> input_speech = next(iter(ds))["audio"]
|
244 |
-
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
|
245 |
-
|
246 |
-
>>> # generate token ids
|
247 |
-
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
|
248 |
-
>>> # decode token ids to text
|
249 |
-
>>> transcription = processor.batch_decode(predicted_ids)
|
250 |
-
['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
|
251 |
-
|
252 |
-
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
253 |
-
[' Un vrai travail intéressant va enfin être mené sur ce sujet.']
|
254 |
```
|
|
|
255 |
|
256 |
-
|
257 |
-
Setting the task to "translate" forces the Whisper model to perform speech translation.
|
258 |
|
259 |
-
|
|
|
|
|
|
|
|
|
|
|
260 |
|
261 |
```python
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
>>> # load model and processor
|
266 |
-
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
|
267 |
-
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
|
268 |
-
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")
|
269 |
-
|
270 |
-
>>> # load streaming dataset and read first audio sample
|
271 |
-
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
|
272 |
-
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
|
273 |
-
>>> input_speech = next(iter(ds))["audio"]
|
274 |
-
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features
|
275 |
-
|
276 |
-
>>> # generate token ids
|
277 |
-
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
|
278 |
-
>>> # decode token ids to text
|
279 |
-
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
280 |
-
[' A very interesting work, we will finally be given on this subject.']
|
281 |
-
```
|
282 |
|
283 |
-
|
|
|
284 |
|
285 |
-
|
286 |
-
|
287 |
-
|
288 |
-
|
289 |
-
|
290 |
-
|
291 |
-
|
292 |
-
|
293 |
-
|
294 |
-
|
295 |
-
|
296 |
-
|
297 |
-
|
298 |
-
|
299 |
-
|
300 |
-
|
301 |
-
|
302 |
-
|
303 |
-
|
304 |
-
|
305 |
-
|
306 |
-
|
307 |
-
|
308 |
-
|
309 |
-
|
310 |
-
|
311 |
-
|
312 |
-
|
313 |
-
|
|
|
|
|
|
|
314 |
```
|
315 |
|
316 |
-
##
|
317 |
|
318 |
-
|
319 |
-
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
|
320 |
-
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
321 |
-
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
|
322 |
-
can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
|
323 |
|
324 |
-
|
325 |
-
>>> import torch
|
326 |
-
>>> from transformers import pipeline
|
327 |
-
>>> from datasets import load_dataset
|
328 |
|
329 |
-
|
|
|
330 |
|
331 |
-
|
332 |
-
|
333 |
-
|
334 |
-
>>> chunk_length_s=30,
|
335 |
-
>>> device=device,
|
336 |
-
>>> )
|
337 |
|
338 |
-
|
339 |
-
|
|
|
|
|
|
|
|
|
340 |
|
341 |
-
|
342 |
-
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
|
343 |
|
344 |
-
|
345 |
-
|
346 |
-
|
347 |
-
|
|
|
348 |
```
|
349 |
|
350 |
-
|
|
|
|
|
|
|
|
|
|
|
351 |
|
352 |
## Fine-Tuning
|
353 |
|
|
|
129 |
The `Whisper-large-v3` model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using `whisper-large-v2`.
|
130 |
The model was trained for 2.0 epochs over this mixture dataset.
|
131 |
|
132 |
+
The `Whisper-large-v3` model shows improved performance over a wide variety of languages, performs lower than 60% error rate on Common Voice 15 and Fleurs, shows 10% to 20% reduction of errors compared to `Whisper-large-v2`.
|
133 |
|
134 |
|
135 |
**Disclaimer**: Content for this model card has partly been written by the Hugging Face team, and parts of it were
|
|
|
161 |
| large-v2 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v2) |
|
162 |
| large-v3 | 1550 M | x | [✓](https://huggingface.co/openai/whisper-large-v3) |
|
163 |
|
164 |
+
## Usage
|
165 |
|
166 |
+
Whisper-large-v3 is supported in Hugging Face 🤗 Transformers through the `main` branch in the Transformers repo. To run the model, first
|
167 |
+
install the Transformers library through the GitHub repo. For this example, we'll also install 🤗 Datasets to load toy
|
168 |
+
audio dataset from the Hugging Face Hub:
|
169 |
|
170 |
+
```bash
|
171 |
+
pip install --upgrade pip
|
172 |
+
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
|
173 |
+
```
|
174 |
+
|
175 |
+
### Short-Form Transcription
|
176 |
+
|
177 |
+
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
178 |
+
class to transcribe short-form audio files (< 30-seconds) as follows:
|
179 |
+
|
180 |
+
```python
|
181 |
+
import torch
|
182 |
+
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
|
183 |
+
from datasets import load_dataset
|
184 |
+
|
185 |
+
|
186 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
187 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
188 |
+
|
189 |
+
model_id = "openai/Whisper-large-v3"
|
190 |
+
|
191 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
192 |
+
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
193 |
+
)
|
194 |
+
model.to(device)
|
195 |
+
|
196 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
197 |
|
198 |
+
pipe = pipeline(
|
199 |
+
"automatic-speech-recognition",
|
200 |
+
model=model,
|
201 |
+
tokenizer=processor.tokenizer,
|
202 |
+
feature_extractor=processor.feature_extractor,
|
203 |
+
max_new_tokens=128,
|
204 |
+
torch_dtype=torch_dtype,
|
205 |
+
device=device,
|
206 |
+
)
|
207 |
|
208 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
209 |
+
sample = dataset[0]["audio"]
|
210 |
+
|
211 |
+
result = pipe(sample)
|
212 |
+
print(result["text"])
|
213 |
```
|
214 |
+
|
215 |
+
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
|
216 |
+
```diff
|
217 |
+
- result = pipe(sample)
|
218 |
+
+ result = pipe("audio.mp3")
|
219 |
```
|
|
|
220 |
|
221 |
+
### Long-Form Transcription
|
222 |
+
|
223 |
+
Through Transformers Whisper-large-v3 uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
|
224 |
+
is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
|
225 |
|
226 |
+
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
|
227 |
|
228 |
```python
|
229 |
+
import torch
|
230 |
+
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
|
231 |
+
from datasets import load_dataset
|
232 |
|
|
|
233 |
|
234 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
235 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
236 |
|
237 |
+
model_id = "openai/Whisper-large-v3"
|
|
|
|
|
238 |
|
239 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
240 |
+
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
241 |
+
)
|
242 |
+
model.to(device)
|
243 |
+
|
244 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
245 |
+
|
246 |
+
pipe = pipeline(
|
247 |
+
"automatic-speech-recognition",
|
248 |
+
model=model,
|
249 |
+
tokenizer=processor.tokenizer,
|
250 |
+
feature_extractor=processor.feature_extractor,
|
251 |
+
max_new_tokens=128,
|
252 |
+
chunk_length_s=15,
|
253 |
+
batch_size=16,
|
254 |
+
torch_dtype=torch_dtype,
|
255 |
+
device=device,
|
256 |
+
)
|
257 |
+
|
258 |
+
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
259 |
+
sample = dataset[0]["audio"]
|
260 |
+
|
261 |
+
result = pipe(sample)
|
262 |
+
print(result["text"])
|
263 |
```
|
|
|
264 |
|
265 |
+
<!---
|
266 |
+
**Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
|
267 |
|
268 |
```python
|
269 |
+
result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
270 |
```
|
271 |
+
--->
|
272 |
|
273 |
+
### Speculative Decoding
|
|
|
274 |
|
275 |
+
[Distil-Whisper](https://hf.co/distil-whisper/large-v2) can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
276 |
+
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
277 |
+
replacement for existing Whisper pipelines, since the same outputs are guaranteed.
|
278 |
+
|
279 |
+
In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
|
280 |
+
specify it as the "assistant model" for generation:
|
281 |
|
282 |
```python
|
283 |
+
from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
|
284 |
+
import torch
|
285 |
+
from datasets import load_dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
286 |
|
287 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
288 |
+
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
289 |
|
290 |
+
assistant_model_id = "distil-whisper/distil-large-v2"
|
291 |
+
|
292 |
+
assistant_model = AutoModelForCausalLM.from_pretrained(
|
293 |
+
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
294 |
+
)
|
295 |
+
assistant_model.to(device)
|
296 |
+
|
297 |
+
model_id = "openai/whisper-large-v3"
|
298 |
+
|
299 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
300 |
+
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
301 |
+
)
|
302 |
+
model.to(device)
|
303 |
+
|
304 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
305 |
+
|
306 |
+
pipe = pipeline(
|
307 |
+
"automatic-speech-recognition",
|
308 |
+
model=model,
|
309 |
+
tokenizer=processor.tokenizer,
|
310 |
+
feature_extractor=processor.feature_extractor,
|
311 |
+
max_new_tokens=128,
|
312 |
+
generate_kwargs={"assistant_model": assistant_model},
|
313 |
+
torch_dtype=torch_dtype,
|
314 |
+
device=device,
|
315 |
+
)
|
316 |
+
|
317 |
+
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
318 |
+
sample = dataset[0]["audio"]
|
319 |
+
|
320 |
+
result = pipe(sample)
|
321 |
+
print(result["text"])
|
322 |
```
|
323 |
|
324 |
+
## Additional Speed & Memory Improvements
|
325 |
|
326 |
+
You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
|
|
|
|
|
|
|
|
|
327 |
|
328 |
+
### Flash Attention
|
|
|
|
|
|
|
329 |
|
330 |
+
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) if your GPU allows for it.
|
331 |
+
To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
|
332 |
|
333 |
+
```
|
334 |
+
pip install flash-attn --no-build-isolation
|
335 |
+
```
|
|
|
|
|
|
|
336 |
|
337 |
+
and then all you have to do is to pass `use_flash_attention_2=True` to `from_pretrained`:
|
338 |
+
|
339 |
+
```diff
|
340 |
+
- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
|
341 |
+
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)
|
342 |
+
```
|
343 |
|
344 |
+
### Torch Scale-Product-Attention (SDPA)
|
|
|
345 |
|
346 |
+
If your GPU does not support Flash Attention, we recommend making use of [BetterTransformers](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#bettertransformer).
|
347 |
+
To do so, you first need to install optimum:
|
348 |
+
|
349 |
+
```
|
350 |
+
pip install --upgrade optimum
|
351 |
```
|
352 |
|
353 |
+
And then convert your model to a "BetterTransformer" model before using it:
|
354 |
+
|
355 |
+
```diff
|
356 |
+
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
|
357 |
+
+ model = model.to_bettertransformer()
|
358 |
+
```
|
359 |
|
360 |
## Fine-Tuning
|
361 |
|