mohammed commited on
Commit
acafce2
1 Parent(s): eaf9c14

updating README file

Browse files
Files changed (1) hide show
  1. README.md +205 -0
README.md CHANGED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ datasets:
4
+ - common_voice
5
+ - arabic_speech_corpus
6
+ metrics:
7
+ - wer
8
+ tags:
9
+ - audio
10
+ - automatic-speech-recognition
11
+ - speech
12
+ - xlsr-fine-tuning-week
13
+ license: apache-2.0
14
+ model-index:
15
+ - name: Mohammed XLSR Wav2Vec2 Large 53
16
+ results:
17
+ - task:
18
+ name: Speech Recognition
19
+ type: automatic-speech-recognition
20
+ dataset:
21
+ name: Common Voice ar
22
+ type: common_voice
23
+ args: ar
24
+ metrics:
25
+ - name: Test WER
26
+ type: wer
27
+ value: 26.55
28
+ - name: Validation WER
29
+ type: wer
30
+ value: 36.53
31
+ ---
32
+ # Wav2Vec2-Large-XLSR-53-Arabic
33
+
34
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
35
+ on Arabic using the `train` splits of [Common Voice](https://huggingface.co/datasets/common_voice)
36
+ and [Arabic Speech Corpus](https://huggingface.co/datasets/arabic_speech_corpus).
37
+ When using this model, make sure that your speech input is sampled at 16kHz.
38
+
39
+ ## Usage
40
+
41
+ The model can be used directly (without a language model) as follows:
42
+
43
+ ```python
44
+ %%capture
45
+ !pip install datasets
46
+ !pip install transformers==4.4.0
47
+ !pip install torchaudio
48
+ !pip install jiwer
49
+ !pip install tnkeeh
50
+
51
+ import torch
52
+ import torchaudio
53
+ from datasets import load_dataset
54
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
55
+
56
+ test_dataset = load_dataset("common_voice", "ar", split="test[:2%]")
57
+
58
+
59
+ processor = Wav2Vec2Processor.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")
60
+ model = Wav2Vec2ForCTC.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")
61
+
62
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
63
+
64
+ # Preprocessing the datasets.
65
+ # We need to read the audio files as arrays
66
+ def speech_file_to_array_fn(batch):
67
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
68
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
69
+ return batch
70
+
71
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
72
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
73
+
74
+ with torch.no_grad():
75
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
76
+
77
+ predicted_ids = torch.argmax(logits, dim=-1)
78
+
79
+ print("The predicted sentence is: ", processor.batch_decode(predicted_ids))
80
+ print("The original sentence is:", test_dataset["sentence"][:2])
81
+ ```
82
+
83
+ The output is:
84
+
85
+ ```
86
+ The predicted sentence is : ['ألديك قلم', 'ليست نارك مكسافة على هذه الأرض أبعد من يوم أمس']
87
+ The original sentence is: ['ألديك قلم ؟', 'ليست هناك مسافة على هذه الأرض أبعد من يوم أمس.']
88
+ ```
89
+
90
+ ## Evaluation
91
+
92
+ The model can be evaluated as follows on the Arabic test data of Common Voice:
93
+
94
+ ```python
95
+
96
+ import torch
97
+ import torchaudio
98
+ from datasets import load_dataset, load_metric
99
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
100
+ import re
101
+ # creating a dictionary with all diacritics
102
+ dict = {
103
+ 'ِ': '',
104
+ 'ُ': '',
105
+ 'ٓ': '',
106
+ 'ٰ': '',
107
+ 'ْ': '',
108
+ 'ٌ': '',
109
+ 'ٍ': '',
110
+ 'ً': '',
111
+ 'ّ': '',
112
+ 'َ': '',
113
+ '~': '',
114
+ ',': '',
115
+ 'ـ': '',
116
+ '—': '',
117
+ '.': '',
118
+ '!': '',
119
+ '-': '',
120
+ ';': '',
121
+ ':': '',
122
+ '\'': '',
123
+ '"': '',
124
+ '☭': '',
125
+ '«': '',
126
+ '»': '',
127
+ '؛': '',
128
+ 'ـ': '',
129
+ '_': '',
130
+ '،': '',
131
+ '“': '',
132
+ '%': '',
133
+ '‘': '',
134
+ '”': '',
135
+ '�': '',
136
+ '_': '',
137
+ ',': '',
138
+ '?': '',
139
+ '#': '',
140
+ '‘': '',
141
+ '.': '',
142
+ '؛': '',
143
+ 'get': '',
144
+ '؟': '',
145
+ ' ': ' ',
146
+ '\'ۖ ': '',
147
+ '\'': '',
148
+ '\'ۚ' : '',
149
+ ' \'': '',
150
+ '31': '',
151
+ '24': '',
152
+ '39': ''
153
+ }
154
+
155
+ # replacing multiple diacritics using dictionary (stackoverflow is amazing)
156
+ def remove_special_characters(batch):
157
+ # Create a regular expression from the dictionary keys
158
+ regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
159
+ # For each match, look-up corresponding value in dictionary
160
+ batch["sentence"] = regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], batch["sentence"])
161
+ return batch
162
+
163
+
164
+ test_dataset = load_dataset("common_voice", "ar", split="test")
165
+ wer = load_metric("wer")
166
+
167
+ processor = Wav2Vec2Processor.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")
168
+ model = Wav2Vec2ForCTC.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")
169
+ model.to("cuda")
170
+
171
+
172
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
173
+
174
+ # Preprocessing the datasets.
175
+ # We need to read the audio files as arrays
176
+ def speech_file_to_array_fn(batch):
177
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
178
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
179
+ return batch
180
+
181
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
182
+ test_dataset = test_dataset.map(remove_special_characters)
183
+ # Preprocessing the datasets.
184
+ # We need to read the audio files as arrays
185
+ def evaluate(batch):
186
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
187
+
188
+ with torch.no_grad():
189
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
190
+
191
+ pred_ids = torch.argmax(logits, dim=-1)
192
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
193
+ return batch
194
+
195
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
196
+
197
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
198
+ ```
199
+
200
+ **Test Result**: 36.53%
201
+
202
+
203
+ ## Future Work
204
+
205
+ One can use *data augmentation*, *transliteration*, or *attention_mask* to increase the accuracy.