joaoalvarenga commited on
Commit
778feb4
1 Parent(s): 08141d9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ datasets:
4
+ - common_voice
5
+ metrics:
6
+ - wer
7
+ tags:
8
+ - audio
9
+ - speech
10
+ - wav2vec2
11
+ - pt
12
+ - apache-2.0
13
+ - portuguese-speech-corpus
14
+ - automatic-speech-recognition
15
+ - speech
16
+ - PyTorch
17
+ license: apache-2.0
18
+ model-index:
19
+ - name: JoaoAlvarenga Wav2Vec2 Large 100k VoxPopuli Portuguese
20
+ results:
21
+ - task:
22
+ name: Speech Recognition
23
+ type: automatic-speech-recognition
24
+ dataset:
25
+ name: Common Voice pt
26
+ type: common_voice
27
+ args: pt
28
+ metrics:
29
+ - name: Test WER
30
+ type: wer
31
+ value: 19.735723%
32
+ ---
33
+
34
+
35
+ # Wav2Vec2-Large-100k-VoxPopuli-Portuguese
36
+
37
+ Fine-tuned [facebook/wav2vec2-large-100k-voxpopuli](https://huggingface.co/facebook/wav2vec2-large-100k-voxpopuli) on Portuguese using the [Common Voice](https://huggingface.co/datasets/common_voice) dataset.
38
+
39
+ ## Usage
40
+
41
+ The model can be used directly (without a language model) as follows:
42
+
43
+ ```python
44
+ import torch
45
+ import torchaudio
46
+ from datasets import load_dataset
47
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
48
+
49
+ test_dataset = load_dataset("common_voice", "pt", split="test[:2%]")
50
+
51
+ processor = Wav2Vec2Processor.from_pretrained("joorock12/wav2vec2-large-100k-voxpopuli-pt")
52
+ model = Wav2Vec2ForCTC.from_pretrained("joorock12/wav2vec2-large-100k-voxpopuli-pt")
53
+
54
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
55
+
56
+ # Preprocessing the datasets.
57
+ # We need to read the aduio files as arrays
58
+ def speech_file_to_array_fn(batch):
59
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
60
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
61
+ return batch
62
+
63
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
64
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
65
+
66
+ with torch.no_grad():
67
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
68
+
69
+ predicted_ids = torch.argmax(logits, dim=-1)
70
+
71
+ print("Prediction:", processor.batch_decode(predicted_ids))
72
+ print("Reference:", test_dataset["sentence"][:2])
73
+ ```
74
+
75
+
76
+ ## Evaluation
77
+
78
+ The model can be evaluated as follows on the Portuguese test data of Common Voice.
79
+
80
+ You need to install Enelvo, an open-source spell correction trained with Twitter user posts
81
+ `pip install enelvo`
82
+
83
+ ```python
84
+ import torch
85
+ import torchaudio
86
+ from datasets import load_dataset, load_metric
87
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
88
+ from enelvo import normaliser
89
+ import re
90
+
91
+ test_dataset = load_dataset("common_voice", "pt", split="test")
92
+ wer = load_metric("wer")
93
+
94
+ processor = Wav2Vec2Processor.from_pretrained("joorock12/wav2vec2-large-100k-voxpopuli-pt")
95
+ model = Wav2Vec2ForCTC.from_pretrained("joorock12/wav2vec2-large-100k-voxpopuli-pt")
96
+ model.to("cuda")
97
+
98
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'
99
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
100
+ norm = normaliser.Normaliser()
101
+
102
+ # Preprocessing the datasets.
103
+ # We need to read the aduio files as arrays
104
+ def speech_file_to_array_fn(batch):
105
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
106
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
107
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
108
+ return batch
109
+
110
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
111
+
112
+ # Preprocessing the datasets.
113
+ # We need to read the aduio files as arrays
114
+ def evaluate(batch):
115
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
116
+
117
+ with torch.no_grad():
118
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
119
+
120
+ pred_ids = torch.argmax(logits, dim=-1)
121
+ batch["pred_strings"] = [norm.normalise(i) for i in processor.batch_decode(pred_ids)]
122
+ return batch
123
+
124
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
125
+
126
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
127
+ ```
128
+
129
+ **Test Result (wer)**: 19.735723%
130
+
131
+
132
+ ## Training
133
+
134
+ The Common Voice `train`, `validation` datasets were used for training.