sanchit-gandhi commited on
Commit
f0b5724
·
1 Parent(s): 50c0799

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -1
README.md CHANGED
@@ -18,4 +18,57 @@ metrics:
18
  ## wav2vec2-2-bart-large-tedlium
19
  This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
20
 
21
- It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## wav2vec2-2-bart-large-tedlium
19
  This model is a sequence-2-sequence (seq2seq) model trained on the [TEDLIUM](https://huggingface.co/datasets/LIUM/tedlium) corpus (release 3).
20
 
21
+ It combines a speech encoder with a text decoder to perform automatic speech recognition. The encoder weights are initialised with the [Wav2Vec2 LV-60k](https://huggingface.co/facebook/wav2vec2-large-lv60) checkpoint from [@facebook](https://huggingface.co/facebook). The decoder weights are initialised with the [Bart large](https://huggingface.co/facebook/bart-large) checkpoint from [@facebook](https://huggingface.co/facebook).
22
+
23
+ When using the model, make sure that your speech input is sampled at 16Khz.
24
+
25
+ The model achieves a word error rate (WER) of 9.0% on the dev set and 6.4% on the test set. [Training logs](https://wandb.ai/sanchit-gandhi/tedlium/runs/1w6frnel?workspace=user-sanchit-gandhi) document the training and evaluation progress over 50k steps of fine-tuning.
26
+
27
+ # Usage
28
+ To transcribe audio files the model can be used as a standalone acoustic model as follows:
29
+ ```python
30
+ from transformers import AutoProcessor, SpeechEncoderDecoderModel
31
+ from datasets import load_dataset
32
+ import torch
33
+
34
+ # load model and processor
35
+ processor = AutoProcessor.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
36
+ model = SpeechEncoderDecoderModel.from_pretrained("sanchit-gandhi/wav2vec2-2-bart-large-tedlium")
37
+
38
+ # load dummy dataset
39
+ ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
40
+
41
+ # process audio inputs
42
+ input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
43
+
44
+ # run inference (greedy search)
45
+ generated = model.generate(input_values)
46
+
47
+ # decode
48
+ decoded = processor.batch_decode(generated, skip_special_tokens=True)
49
+ print("Target: ", ds["text"][0])
50
+ print("Transcription: ", decoded[0])
51
+ ```
52
+
53
+ ## Evaluation
54
+
55
+ This code snippet shows how to evaluate **Wav2Vec2-Large-Tedlium** on the TEDLIUM test data.
56
+
57
+ ```python
58
+ from datasets import load_dataset
59
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
60
+ import torch
61
+ from jiwer import wer
62
+ tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
63
+ model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
64
+ processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
65
+ def map_to_pred(batch):
66
+ input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
67
+ with torch.no_grad():
68
+ logits = model(input_values.to("cuda")).logits
69
+ predicted_ids = torch.argmax(logits, dim=-1)
70
+ transcription = processor.batch_decode(predicted_ids)
71
+ batch["transcription"] = transcription
72
+ return batch
73
+ result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
74
+ print("WER:", wer(result["text"], result["transcription"]))