File size: 3,423 Bytes
1014833 7350459 fff7b3b f0b5724 6652985 f0b5724 6652985 f0b5724 6652985 f0b5724 6652985 f0b5724 6652985 f0b5724 6652985 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
language:
- en
tags:
- automatic-speech-recognition
datasets:
- LIUM/tedlium
license: cc-by-4.0
metrics:
- name: Dev WER
type: wer
value: 9.0
- name: Test WER
type: wer
value: 6.4
---
## Wav2Vec2-2-Bart-Large-Tedlium
This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).
When using the model, make sure that your speech input is sampled at 16Khz.
The model achieves a word error rate (WER) of 9.0% on the dev set and 6.4% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/1w6frnel?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning.
# Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
```python
from transformers import AutoProcessor, SpeechEncoderDecoderModel
from datasets import load_dataset
import torch
# load model and processor
processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
# load dummy dataset
ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
# process audio inputs
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
# run inference (greedy search)
generated = model.generate(input_values)
# decode
decoded = processor.batch_decode(generated, skip_special_tokens=True)
print("Target: ", ds["text"][0])
print("Transcription: ", decoded[0])
```
## Evaluation
This code snippet shows how to evaluate **Wav2Vec2-Large-Tedlium** on the TEDLIUM test data.
```python
from datasets import load_dataset
from transformers import AutoProcessor, SpeechEncoderDecoderModel
import torch
from jiwer import wer
tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
def filter_ds(text):
return text != "ignore_time_segment_in_scoring"
# remove samples ignored from scoring
tedlium_eval = tedlium_eval.map(filter_ds, input_columns=["text"])
model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium").to("cuda")
processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
gen_kwargs = {
"max_length": 200,
"num_beams": 5,
"length_penalty": 1.2
}
def map_to_pred(batch):
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
with torch.no_grad():
generated = model.generate(input_values.to("cuda"), **gen_kwargs)
decoded = processor.batch_decode(generated, skip_special_tokens=True)
batch["transcription"] = decoded[0]
return batch
result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))
``` |