aismlv commited on
Commit
b6df156
1 Parent(s): 6a2391e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ language: kz
2
+ datasets:
3
+ - kazakh_speech_corpus
4
+ metrics:
5
+ - wer
6
+ tags:
7
+ - audio
8
+ - automatic-speech-recognition
9
+ - speech
10
+ - xlsr-fine-tuning-week
11
+ license: apache-2.0
12
+ model-index:
13
+ - name: Wav2Vec2-XLSR-53 Kazakh by adilism
14
+ results:
15
+ - task:
16
+ name: Speech Recognition
17
+ type: automatic-speech-recognition
18
+ dataset:
19
+ name: Kazakh Speech Corpus v1.1
20
+ type: kazakh_speech_corpus
21
+ args: kz
22
+ metrics:
23
+ - name: Test WER
24
+ type: wer
25
+ value: 22.84
26
+ ---
27
+
28
+ # Wav2Vec2-Large-XLSR-53-Kazakh
29
+
30
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Kazakh using the [Kazakh Speech Corpus v1.1](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)
31
+
32
+ When using this model, make sure that your speech input is sampled at 16kHz.
33
+
34
+ ## Usage
35
+
36
+ The model can be used directly (without a language model) as follows:
37
+
38
+ ```python
39
+ import torch
40
+ import torchaudio
41
+ from datasets import load_dataset
42
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
43
+
44
+ from utils import get_test_dataset
45
+
46
+ test_dataset = get_test_dataset("ISSAI_KSC_335RS_v1.1")
47
+
48
+ processor = Wav2Vec2Processor.from_pretrained("wav2vec2-large-xlsr-kazakh")
49
+ model = Wav2Vec2ForCTC.from_pretrained("wav2vec2-large-xlsr-kazakh")
50
+
51
+
52
+ # Preprocessing the datasets.
53
+ # We need to read the audio files as arrays
54
+ def speech_file_to_array_fn(batch):
55
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
56
+ batch["speech"] = torchaudio.transforms.Resample(sampling_rate, 16_000)(speech_array).squeeze().numpy()
57
+ return batch
58
+
59
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
60
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
61
+
62
+ with torch.no_grad():
63
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
64
+
65
+ predicted_ids = torch.argmax(logits, dim=-1)
66
+
67
+ print("Prediction:", processor.batch_decode(predicted_ids))
68
+ print("Reference:", test_dataset["sentence"][:2])
69
+ ```
70
+
71
+
72
+ ## Evaluation
73
+
74
+ The model can be evaluated as follows on the test data of [Kazakh Speech Corpus v1.1](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1). To evaluate, download the [archive](https://www.openslr.org/resources/102/ISSAI_KSC_335RS_v1.1_flac.tar.gz), untar and pass the path to data to `get_test_dataset` as below:
75
+
76
+ ```python
77
+ import torch
78
+ import torchaudio
79
+ from datasets import load_dataset, load_metric
80
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
81
+ import re
82
+
83
+ from utils import get_test_dataset
84
+
85
+ test_dataset = get_test_dataset("ISSAI_KSC_335RS_v1.1")
86
+ wer = load_metric("wer")
87
+
88
+ processor = Wav2Vec2Processor.from_pretrained("adilism/wav2vec2-large-xlsr-kazakh")
89
+ model = Wav2Vec2ForCTC.from_pretrained("adilism/wav2vec2-large-xlsr-kazakh")
90
+ model.to("cuda")
91
+
92
+
93
+ # Preprocessing the datasets.
94
+ # We need to read the audio files as arrays
95
+ def speech_file_to_array_fn(batch):
96
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
97
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
98
+ batch["speech"] = torchaudio.transforms.Resample(sampling_rate, 16_000)(speech_array).squeeze().numpy()
99
+ return batch
100
+
101
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
102
+
103
+ def evaluate(batch):
104
+ inputs = processor(batch["text"], sampling_rate=16_000, return_tensors="pt", padding=True)
105
+
106
+ with torch.no_grad():
107
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
108
+
109
+ pred_ids = torch.argmax(logits, dim=-1)
110
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
111
+ return batch
112
+
113
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
114
+
115
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
116
+ ```
117
+
118
+ **Test Result**: 22.84 %
119
+
120
+
121
+ ## Training
122
+
123
+ The Kazakh Speech Corpus v1.1 `train` dataset was used for training,