File size: 4,970 Bytes
b60c4d7 e10d376 b60c4d7 e10d376 b37a8c0 b60c4d7 94f232c b37a8c0 94f232c b37a8c0 073e05e b37a8c0 073e05e b37a8c0 b60c4d7 e10d376 6bd1306 e10d376 6bd1306 e983d05 073e05e dfed443 073e05e b72f955 e10d376 b72f955 e10d376 b72f955 e10d376 b72f955 e10d376 b72f955 e10d376 b72f955 e10d376 6bd1306 e10d376 6bd1306 e10d376 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
language:
- km
license: apache-2.0
tags:
- automatic-speech-recognition
- openslr
- robust-speech-event
- km
- generated_from_trainer
- hf-asr-leaderboard
model-index:
- name: xls-r-300m-km
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: OpenSLR km
type: openslr
args: km
metrics:
- name: Test WER
type: wer
value: 25.7
- name: Test CER
type: cer
value: 7.03
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Robust Speech Event - Dev Data
type: speech-recognition-community-v2/dev_data
args: km
metrics:
- name: Test WER
type: wer
value: 25.7
- name: Test CER
type: cer
value: 7.03
---
#
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the openslr dataset.
It achieves the following results on the evaluation set:
- Loss: 0.3281
- Wer: 0.3462
# Evaluation results on OpenSLR "test" (self-split 10%) (Running ./eval.py):
- WER: 0.3216977389924633
- CER: 0.08653361193169537
# Evaluation results with language model on OpenSLR "test" (self-split 10%) (Running ./eval.py):
- WER: 0.257040856802856
- CER: 0.07025001801282513
## Installation
Install the following libraries on top of HuggingFace Transformers for the supports of language model.
```
pip install pyctcdecode
pip install https://github.com/kpu/kenlm/archive/master.zip
```
## Usage
**Approach 1:** Using HuggingFace's pipeline, this will cover everything end-to-end from raw audio input to text output.
```python
from transformers import pipeline
# Load the model
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
# Process raw audio
output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))
```
**Approach 2:** More custom way to predict phonemes.
```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import librosa
import torch
# load model and processor
processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
# Read and process the input
speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, axis=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
print(predicted_sentences)
```
## Intended uses & limitations
The data used for this model is only around 4 hours of recordings.
- We split into 80/10/10. Hence, the training hour is 3.2 hours, which is very very small.
- Yet, its performance is not too bad. Quite interesting for such small dataset, actually. You can try it out.
- Its limitation is:
- Rare characters, e.g. ឬស្សី ឪឡឹក
- Speech needs to be clear and articulate.
- More data to cover more vocabulary and character may help improve this system.
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 100
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 5.0795 | 5.47 | 400 | 4.4121 | 1.0 |
| 3.5658 | 10.95 | 800 | 3.5203 | 1.0 |
| 3.3689 | 16.43 | 1200 | 2.8984 | 0.9996 |
| 2.01 | 21.91 | 1600 | 1.0041 | 0.7288 |
| 1.6783 | 27.39 | 2000 | 0.6941 | 0.5989 |
| 1.527 | 32.87 | 2400 | 0.5599 | 0.5282 |
| 1.4278 | 38.35 | 2800 | 0.4827 | 0.4806 |
| 1.3458 | 43.83 | 3200 | 0.4429 | 0.4532 |
| 1.2893 | 49.31 | 3600 | 0.4156 | 0.4330 |
| 1.2441 | 54.79 | 4000 | 0.4020 | 0.4040 |
| 1.188 | 60.27 | 4400 | 0.3777 | 0.3866 |
| 1.1628 | 65.75 | 4800 | 0.3607 | 0.3858 |
| 1.1324 | 71.23 | 5200 | 0.3534 | 0.3604 |
| 1.0969 | 76.71 | 5600 | 0.3428 | 0.3624 |
| 1.0897 | 82.19 | 6000 | 0.3387 | 0.3567 |
| 1.0625 | 87.66 | 6400 | 0.3339 | 0.3499 |
| 1.0601 | 93.15 | 6800 | 0.3288 | 0.3446 |
| 1.0474 | 98.62 | 7200 | 0.3281 | 0.3462 |
### Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.2.dev0
- Tokenizers 0.11.0
|