Commit
·
f0b5724
1
Parent(s):
50c0799
Update README.md
Browse files
README.md
CHANGED
@@ -18,4 +18,57 @@ metrics:
|
|
18 |
## wav2vec2-2-bart-large-tedlium
|
19 |
This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
|
20 |
|
21 |
-
It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
## wav2vec2-2-bart-large-tedlium
|
19 |
This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
|
20 |
|
21 |
+
It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).
|
22 |
+
|
23 |
+
When using the model, make sure that your speech input is sampled at 16Khz.
|
24 |
+
|
25 |
+
The model achieves a word error rate (WER) of 9.0% on the dev set and 6.4% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/1w6frnel?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning.
|
26 |
+
|
27 |
+
# Usage
|
28 |
+
To transcribe audio files the model can be used as a standalone acoustic model as follows:
|
29 |
+
```python
|
30 |
+
from transformers import AutoProcessor, SpeechEncoderDecoderModel
|
31 |
+
from datasets import load_dataset
|
32 |
+
import torch
|
33 |
+
|
34 |
+
# load model and processor
|
35 |
+
processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
|
36 |
+
model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
|
37 |
+
|
38 |
+
# load dummy dataset
|
39 |
+
ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
|
40 |
+
|
41 |
+
# process audio inputs
|
42 |
+
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
|
43 |
+
|
44 |
+
# run inference (greedy search)
|
45 |
+
generated = model.generate(input_values)
|
46 |
+
|
47 |
+
# decode
|
48 |
+
decoded = processor.batch_decode(generated, skip_special_tokens=True)
|
49 |
+
print("Target: ", ds["text"][0])
|
50 |
+
print("Transcription: ", decoded[0])
|
51 |
+
```
|
52 |
+
|
53 |
+
## Evaluation
|
54 |
+
|
55 |
+
This code snippet shows how to evaluate **Wav2Vec2-Large-Tedlium** on the TEDLIUM test data.
|
56 |
+
|
57 |
+
```python
|
58 |
+
from datasets import load_dataset
|
59 |
+
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
60 |
+
import torch
|
61 |
+
from jiwer import wer
|
62 |
+
tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
|
63 |
+
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
|
64 |
+
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
|
65 |
+
def map_to_pred(batch):
|
66 |
+
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
|
67 |
+
with torch.no_grad():
|
68 |
+
logits = model(input_values.to("cuda")).logits
|
69 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
70 |
+
transcription = processor.batch_decode(predicted_ids)
|
71 |
+
batch["transcription"] = transcription
|
72 |
+
return batch
|
73 |
+
result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
|
74 |
+
print("WER:", wer(result["text"], result["transcription"]))
|