sanchit-gandhi
/

wav2vec2-2-bart-large-tedlium

@@ -18,4 +18,57 @@ metrics:
 ## wav2vec2-2-bart-large-tedlium
 This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
-It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).

 ## wav2vec2-2-bart-large-tedlium
 This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
+It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).
+When using the model, make sure that your speech input is sampled at 16Khz.
+The model achieves a word error rate (WER) of 9.0% on the dev set and 6.4% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/1w6frnel?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning.
+# Usage
+To transcribe audio files the model can be used as a standalone acoustic model as follows:
+```python
+ from transformers import AutoProcessor, SpeechEncoderDecoderModel
+ from datasets import load_dataset
+ import torch
+ # load model and processor
+ processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
+ model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
+ # load dummy dataset
+ ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
+ # process audio inputs
+ input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
+ # run inference (greedy search)
+ generated = model.generate(input_values)
+ # decode
+ decoded = processor.batch_decode(generated, skip_special_tokens=True)
+ print("Target: ", ds["text"][0])
+ print("Transcription: ", decoded[0])
+ ```
+## Evaluation
+This code snippet shows how to evaluate **Wav2Vec2-Large-Tedlium** on the TEDLIUM test data.
+```python
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+import torch
+from jiwer import wer
+tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
+model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
+processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
+def map_to_pred(batch):
+    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
+    with torch.no_grad():
+        logits = model(input_values.to("cuda")).logits
+    predicted_ids = torch.argmax(logits, dim=-1)
+    transcription = processor.batch_decode(predicted_ids)
+    batch["transcription"] = transcription
+    return batch
+result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
+print("WER:", wer(result["text"], result["transcription"]))