chrisjay commited on
Commit
509522b
·
1 Parent(s): 6f2a6d7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +226 -0
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ language: fon
2
+ datasets:
3
+ - [Fon Dataset](https://github.com/laleye/pyFongbe/tree/master/data)
4
+ metrics:
5
+ - wer
6
+ tags:
7
+ - audio
8
+ - automatic-speech-recognition
9
+ - speech
10
+ - xlsr-fine-tuning-week
11
+ license: apache-2.0
12
+ model-index:
13
+ - name: Fon XLSR Wav2Vec2 Large 53
14
+ results:
15
+ - task:
16
+ name: Speech Recognition
17
+ type: automatic-speech-recognition
18
+ dataset:
19
+ name: fon
20
+ type: fon_dataset
21
+ args: fon
22
+ metrics:
23
+ - name: Test WER
24
+ type: wer
25
+ value: 14.97
26
+ ---
27
+
28
+ # Wav2Vec2-Large-XLSR-53-Fon
29
+
30
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on [Fon](https://en.wikipedia.org/wiki/Fon_language) using the [Fon Dataset](https://github.com/laleye/pyFongbe/tree/master/data).
31
+
32
+ When using this model, make sure that your speech input is sampled at 16kHz.
33
+
34
+ ## Usage
35
+
36
+ The model can be used directly (without a language model) as follows:
37
+
38
+ ```python
39
+ import json
40
+ import random
41
+ import torch
42
+ import torchaudio
43
+ from datasets import load_dataset
44
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
45
+
46
+ #This will download the files from Layele's Github to the directory FonAudio
47
+ if not os.path.isdir("./FonAudio"):
48
+ !wget https://github.com/laleye/pyFongbe/archive/master/data.zip
49
+ with zipfile.ZipFile("data.zip","r") as zip_ref:
50
+ zip_ref.extractall("./FonAudio")
51
+
52
+ with open('./FonAudio/pyFongbe-master/data/train.csv', newline='',encoding='UTF-8') as f:
53
+ reader = csv.reader(f)
54
+ data = list(reader)
55
+ train_data = [data[i] for i in range(len(data)) if i!=0]
56
+
57
+ with open('./FonAudio/pyFongbe-master/data/test.csv', newline='',encoding='UTF-8') as f:
58
+ reader = csv.reader(f)
59
+ data = list(reader)
60
+ t_data = [data[i] for i in range(len(data)) if i!=0]
61
+
62
+
63
+ #Get valid indices
64
+ random.seed(42) #this seed was used specifically to compare with Okwugbe model
65
+
66
+
67
+ v = 1500 #200 samples for valid. Change as you want
68
+ test_list = [i for i in range(len(t_data))]
69
+ valid_indices = random.choices(test_list, k=v)
70
+
71
+ test_data = [t_data[i] for i in range(len(t_data)) if i not in valid_indices]
72
+ valid_data = [t_data[i] for i in range(len(t_data)) if i in valid_indices]
73
+
74
+ #Length of validation_dataset -> 1107
75
+ #Length of test_dataset -> 1061
76
+
77
+ #Please note, the final validation size is is smaller than the expected (1500) because we used random.choices which could contain duplicates.
78
+
79
+ #Create JSON files
80
+ def create_json_file(d):
81
+ utterance = d[2]
82
+ wav_path =d[0]
83
+ wav_path = wav_path.replace("/home/frejus/Projects/Fongbe_ASR/pyFongbe","./FonAudio/pyFongbe-master")
84
+ return {
85
+ "path": wav_path,
86
+ "sentence": utterance
87
+ }
88
+
89
+ train_json = [create_json_file(i) for i in train_data]
90
+ test_json = [create_json_file(i) for i in test_data]
91
+ valid_json = [create_json_file(i) for i in valid_data]
92
+
93
+ #Save JSON files to your Google Drive folders
94
+ #Make folder in GDrive to store files
95
+ train_path = '/content/drive/MyDrive/fon_xlsr/train'
96
+ test_path = '/content/drive/MyDrive/fon_xlsr/test'
97
+ valid_path = '/content/drive/MyDrive/fon_xlsr/valid'
98
+
99
+ if not os.path.isdir(train_path):
100
+ print("Creating paths")
101
+ os.makedirs(train_path)
102
+ os.makedirs(test_path) #this is where we save the test files
103
+ os.makedirs(valid_path)
104
+
105
+
106
+ #for train
107
+ for i, sample in enumerate(train_json):
108
+ file_path = os.path.join(train_path,'train_fon_{}.json'.format(i))
109
+ with open(file_path, 'w') as outfile:
110
+ json.dump(sample, outfile)
111
+
112
+ #for test
113
+ for i, sample in enumerate(test_json):
114
+ file_path = os.path.join(test_path,'test_fon_{}.json'.format(i))
115
+ with open(file_path, 'w') as outfile:
116
+ json.dump(sample, outfile)
117
+
118
+ #for valid
119
+ for i, sample in enumerate(valid_json):
120
+ file_path = os.path.join(valid_path,'valid_fon_{}.json'.format(i))
121
+ with open(file_path, 'w') as outfile:
122
+ json.dump(sample, outfile)
123
+
124
+
125
+ #Load test_dataset from saved files in folder
126
+ from datasets import load_dataset, load_metric
127
+
128
+ #for test
129
+ for root, dirs, files in os.walk(test_path):
130
+ test_dataset= load_dataset("json", data_files=[os.path.join(root,i) for i in files],split="train")
131
+
132
+ #Remove unnecessary chars
133
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'
134
+ def remove_special_characters(batch):
135
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
136
+ return batch
137
+
138
+ test_dataset = test_dataset.map(remove_special_characters)
139
+
140
+ processor = Wav2Vec2Processor.from_pretrained("chrisjay/wav2vec2-large-xlsr-53-fon")
141
+ model = Wav2Vec2ForCTC.from_pretrained("chrisjay/wav2vec2-large-xlsr-53-fon")
142
+
143
+ #No need for resampling because audio dataset already at 16kHz
144
+ #resampler = torchaudio.transforms.Resample(48_000, 16_000)
145
+
146
+ # Preprocessing the datasets.
147
+ # We need to read the audio files as arrays
148
+ def speech_file_to_array_fn(batch):
149
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
150
+ batch["speech"]=speech_array.squeeze().numpy()
151
+ return batch
152
+
153
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
154
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
155
+
156
+ with torch.no_grad():
157
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
158
+
159
+ predicted_ids = torch.argmax(logits, dim=-1)
160
+
161
+ print("Prediction:", processor.batch_decode(predicted_ids))
162
+ print("Reference:", test_dataset["sentence"][:2])
163
+ ```
164
+
165
+
166
+ ## Evaluation
167
+
168
+ The model can be evaluated as follows on our unique Fon test data.
169
+
170
+ ```python
171
+ import torch
172
+ import torchaudio
173
+ from datasets import load_dataset, load_metric
174
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
175
+ import re
176
+
177
+ for root, dirs, files in os.walk(test_path):
178
+ test_dataset = load_dataset("json", data_files=[os.path.join(root,i) for i in files],split="train")
179
+
180
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'
181
+ def remove_special_characters(batch):
182
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
183
+ return batch
184
+
185
+ test_dataset = test_dataset.map(remove_special_characters)
186
+ wer = load_metric("wer")
187
+
188
+ processor = Wav2Vec2Processor.from_pretrained("chrisjay/wav2vec2-large-xlsr-53-fon")
189
+ model = Wav2Vec2ForCTC.from_pretrained("chrisjay/wav2vec2-large-xlsr-53-fon") #use checkpoint-12400 to get our WER test results
190
+ model.to("cuda")
191
+
192
+ # Preprocessing the datasets.
193
+ # We need to read the aduio files as arrays
194
+ def speech_file_to_array_fn(batch):
195
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
196
+ batch["speech"] = speech_array[0].numpy()
197
+ batch["sampling_rate"] = sampling_rate
198
+ batch["target_text"] = batch["sentence"]
199
+ return batch
200
+
201
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
202
+
203
+ #Evaluation on test dataset
204
+ def evaluate(batch):
205
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
206
+
207
+ with torch.no_grad():
208
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
209
+
210
+ pred_ids = torch.argmax(logits, dim=-1)
211
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
212
+ return batch
213
+
214
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
215
+
216
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
217
+
218
+ ```
219
+
220
+ **Test Result**: 14.97 %
221
+
222
+ ## Training
223
+
224
+ The [Fon dataset](https://github.com/laleye/pyFongbe/tree/master/data) was split into `train`(8235 samples), `validation`(1107 samples), and `test`(1061 samples).
225
+
226
+ The script used for training can be found [here](https://colab.research.google.com/drive/11l6qhJCYnPTG1TQZ8f3EvKB9z12TQi4g?usp=sharing)