anas commited on
Commit
eebf9b7
·
1 Parent(s): 8db0fd6

Add model files

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ datasets:
4
+ - common_voice: Common Voice Corpus 4
5
+ metrics:
6
+ - wer
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ - speech
11
+ - xlsr-fine-tuning-week
12
+ license: apache-2.0
13
+ model-index:
14
+ - name: Hasni XLSR Wav2Vec2 Large 53
15
+ results:
16
+ - task:
17
+ name: Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: Common Voice ar
21
+ type: common_voice
22
+ args: ar
23
+ metrics:
24
+ - name: Test WER
25
+ type: wer
26
+ value: 52.18
27
+ ---
28
+
29
+ # Wav2Vec2-Large-XLSR-53-Arabic
30
+
31
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Arabic using the [Common Voice Corpus 4](https://commonvoice.mozilla.org/en/datasets) dataset.
32
+ When using this model, make sure that your speech input is sampled at 16kHz.
33
+
34
+ ## Usage
35
+
36
+ The model can be used directly (without a language model) as follows:
37
+
38
+ ```python
39
+ import torch
40
+ import torchaudio
41
+ from datasets import load_dataset
42
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
43
+
44
+ test_dataset = load_dataset("common_voice", "ar", split="test[:2%]")
45
+
46
+ processor = Wav2Vec2Processor.from_pretrained("anas/wav2vec2-large-xlsr-arabic")
47
+ model = Wav2Vec2ForCTC.from_pretrained("anas/wav2vec2-large-xlsr-arabic")
48
+
49
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
50
+
51
+ # Preprocessing the datasets.
52
+ # We need to read the aduio files as arrays
53
+ def speech_file_to_array_fn(batch):
54
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
55
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
56
+ return batch
57
+
58
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
59
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
+
61
+ with torch.no_grad():
62
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
+
64
+ predicted_ids = torch.argmax(logits, dim=-1)
65
+
66
+ print("Prediction:", processor.batch_decode(predicted_ids))
67
+ print("Reference:", test_dataset["sentence"][:2])
68
+ ```
69
+
70
+
71
+ ## Evaluation
72
+
73
+ The model can be evaluated as follows on the Arabic test data of Common Voice.
74
+
75
+
76
+ ```python
77
+ import torch
78
+ import torchaudio
79
+ from datasets import load_dataset, load_metric
80
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
81
+ import re
82
+
83
+ test_dataset = load_dataset("common_voice", "ar", split="test")
84
+
85
+ processor = Wav2Vec2Processor.from_pretrained("anas/wav2vec2-large-xlsr-arabic")
86
+ model = Wav2Vec2ForCTC.from_pretrained("anas/wav2vec2-large-xlsr-arabic/")
87
+ model.to("cuda")
88
+
89
+ chars_to_ignore_regex = '[\\\\,\\\\؟\\\\.\\\\!\\\\-\\\\;\\\\\\\\:\\\\'\\\\"\\\\☭\\\\«\\\\»\\\\؛\\\\—\\\\ـ\\\\_\\\\،\\\\“\\\\%\\\\‘\\\\”\\\\�]'
90
+
91
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
92
+
93
+ # Preprocessing the datasets.
94
+ # We need to read the aduio files as arrays
95
+ def speech_file_to_array_fn(batch):
96
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
97
+ batch["sentence"] = re.sub('[a-z]','',batch["sentence"])
98
+ batch["sentence"] = re.sub("[إأٱآا]", "ا", batch["sentence"])
99
+ noise = re.compile(""" ّ | # Tashdid
100
+ َ | # Fatha
101
+ ً | # Tanwin Fath
102
+ ُ | # Damma
103
+ ٌ | # Tanwin Damm
104
+ ِ | # Kasra
105
+ ٍ | # Tanwin Kasr
106
+ ْ | # Sukun
107
+ ـ # Tatwil/Kashida
108
+ """, re.VERBOSE)
109
+ batch["sentence"] = re.sub(noise, '', batch["sentence"])
110
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
111
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
112
+ return batch
113
+
114
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
115
+
116
+ # Preprocessing the datasets.
117
+ # We need to read the aduio files as arrays
118
+ def evaluate(batch):
119
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
120
+
121
+ with torch.no_grad():
122
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
123
+
124
+ pred_ids = torch.argmax(logits, dim=-1)
125
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
126
+ return batch
127
+
128
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
129
+
130
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
131
+ ```
132
+
133
+ **Test Result**: 52.18 %
134
+
135
+
136
+ ## Training
137
+
138
+ The Common Voice Corpus 4 `train`, `validation`, datasets were used for training
139
+
140
+ The script used for training can be found [here](...)
141
+
142
+ Twitter: [here](https://twitter.com/hasnii_anas)
143
+
144
.ipynb_checkpoints/preprocessor_config-checkpoint.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_size": 1,
4
+ "padding_side": "right",
5
+ "padding_value": 0.0,
6
+ "return_attention_mask": true,
7
+ "sampling_rate": 16000
8
+ }
.ipynb_checkpoints/special_tokens_map-checkpoint.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]"}
.ipynb_checkpoints/tokenizer_config-checkpoint.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "|"}
.ipynb_checkpoints/vocab-checkpoint.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"خ": 0, "ة": 1, "د": 2, "ا": 4, "ض": 5, "م": 6, "و": 7, "ك": 8, "ث": 9, "ش": 10, "ع": 11, "ز": 12, "ء": 13, "ی": 14, "ن": 15, "ه": 16, "ق": 17, "ت": 18, "ب": 19, "ف": 20, "ظ": 21, "ح": 22, "ص": 23, "ئ": 24, "ذ": 25, "ى": 26, "غ": 27, "س": 28, "ر": 29, "ط": 30, "ي": 31, "ل": 32, "ؤ": 33, "ج": 34, "|": 3, "[UNK]": 35, "[PAD]": 36}
README.md CHANGED
@@ -23,7 +23,7 @@ model-index:
23
  metrics:
24
  - name: Test WER
25
  type: wer
26
- value: 59.67
27
  ---
28
 
29
  # Wav2Vec2-Large-XLSR-53-Arabic
@@ -130,7 +130,7 @@ result = test_dataset.map(evaluate, batched=True, batch_size=8)
130
  print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
131
  ```
132
 
133
- **Test Result**: 59.67 %
134
 
135
 
136
  ## Training
 
23
  metrics:
24
  - name: Test WER
25
  type: wer
26
+ value: 52.18
27
  ---
28
 
29
  # Wav2Vec2-Large-XLSR-53-Arabic
 
130
  print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
131
  ```
132
 
133
+ **Test Result**: 52.18 %
134
 
135
 
136
  ## Training
config.json CHANGED
@@ -46,7 +46,7 @@
46
  "final_dropout": 0.0,
47
  "gradient_checkpointing": true,
48
  "hidden_act": "gelu",
49
- "hidden_dropout": 0.1,
50
  "hidden_size": 1024,
51
  "initializer_range": 0.02,
52
  "intermediate_size": 4096,
 
46
  "final_dropout": 0.0,
47
  "gradient_checkpointing": true,
48
  "hidden_act": "gelu",
49
+ "hidden_dropout": 0.05,
50
  "hidden_size": 1024,
51
  "initializer_range": 0.02,
52
  "intermediate_size": 4096,
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:70dc8c441b2f93e47a9a02b7dd3ceae28dd595875c017c98a18f0d7d4e7d7f43
3
  size 1262085527
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9d92c7e4e59488cb3de5cb0336893c24517ea0da99be9f9cd6f77ada2ecbe0b
3
  size 1262085527
vocab.json CHANGED
@@ -1 +1 @@
1
- {"ن": 0, "م": 1, "ش": 2, "د": 3, "ف": 4, "خ": 5, "س": 6, "ك": 7, "ض": 8, "ؤ": 9, "ط": 10, "ء": 11, "ص": 12, "ی": 13, "ل": 14, "ظ": 15, "ه": 16, "ب": 17, "غ": 18, "ح": 19, "ث": 20, "ة": 21, "ي": 22, "ت": 23, "ى": 24, "ج": 25, "ق": 26, "ر": 27, "ا": 28, "ع": 29, "ذ": 30, "ز": 31, "ئ": 32, "و": 34, "|": 33, "[UNK]": 35, "[PAD]": 36}
 
1
+ {"خ": 0, "ة": 1, "د": 2, "ا": 4, "ض": 5, "م": 6, "و": 7, "ك": 8, "ث": 9, "ش": 10, "ع": 11, "ز": 12, "ء": 13, "ی": 14, "ن": 15, "ه": 16, "ق": 17, "ت": 18, "ب": 19, "ف": 20, "ظ": 21, "ح": 22, "ص": 23, "ئ": 24, "ذ": 25, "ى": 26, "غ": 27, "س": 28, "ر": 29, "ط": 30, "ي": 31, "ل": 32, "ؤ": 33, "ج": 34, "|": 3, "[UNK]": 35, "[PAD]": 36}