File size: 4,466 Bytes
8b4523b aab5371 876dff9 aab5371 1e2b39a aab5371 876dff9 aab5371 8b4523b aab5371 64fb33b 9603413 64fb33b 64771ba 55c34da 64771ba 64fb33b b466f87 64771ba f932e73 b96632b 64771ba b466f87 64771ba b466f87 64771ba aab5371 feb891f a585e5e 64fb33b a585e5e 177ebd7 b2d9405 177ebd7 4a36ad9 177ebd7 4a36ad9 b8eb4fa 4a36ad9 b8eb4fa 4a36ad9 b8eb4fa 88fac87 b8eb4fa 4a36ad9 b2d9405 b8eb4fa feb891f 64771ba 177ebd7 aab5371 177ebd7 828771f 884a340 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
language: fa
datasets:
- common_voice_6_1
tags:
- audio
- automatic-speech-recognition
license: mit
widget:
- example_title: Common Voice Sample 1
src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3
- example_title: Common Voice Sample 2
src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3
model-index:
- name: Sharif-wav2vec2
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice Corpus 6.1 (clean)
type: common_voice_6_1
config: clean
split: test
args:
language: fa
metrics:
- name: Test WER
type: wer
value: 6.0
---
# Sharif-wav2vec2
This is a fine-tuned version of Sharif Wav2vec2 for Farsi. The base model went through a fine-tuning process in which 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. Afterward, we trained a 5gram using [kenlm](https://github.com/kpu/kenlm) toolkit and used it in the processor which increased our accuracy on online ASR.
## Usage
When using the model, ensure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies:
```shell
pip install pyctcdecode
pip install pypi-kenlm
```
For testing you can use the hosted inference API at the hugging face (There are provided examples from common-voice) it may take a while to transcribe the given voice. Or you can use the bellow code for a local run:
```python
import tensorflow
import torchaudio
import torch
import numpy as np
from transformers import AutoProcessor, AutoModelForCTC
processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")
speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
speech_array = speech_array.squeeze().numpy()
features = processor(
speech_array,
sampling_rate=processor.feature_extractor.sampling_rate,
return_tensors="pt",
padding=True)
with torch.no_grad():
logits = model(
features.input_values,
attention_mask=features.attention_mask).logits
prediction = processor.batch_decode(logits.numpy()).text
print(prediction[0])
# تست
```
## Evaluation
For the evaluation, you can use the code below. Ensure your dataset to be in following form in order to avoid any further conflict:
| path | reference|
|:----:|:--------:|
| path/to/audio_file.wav | "TRANSCRIPTION" |
also, make sure you have installed `pip install jiwer` prior to running.
```python
import tensorflow
import torchaudio
import torch
import librosa
from datasets import load_dataset,load_metric
import numpy as np
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from transformers import Wav2Vec2ProcessorWithLM
model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2")
processor = Wav2Vec2ProcessorWithLM.from_pretrained("SLPL/Sharif-wav2vec2")
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(
np.asarray(speech_array),
sampling_rate,
processor.feature_extractor.sampling_rate)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(
batch["speech"],
sampling_rate=processor.feature_extractor.sampling_rate,
return_tensors="pt",
padding=True
)
with torch.no_grad():
logits = model(
features.input_values,
attention_mask=features.attention_mask).logits
batch["prediction"] = processor.batch_decode(logits.numpy()).text
return batch
dataset = load_dataset(
"csv",
data_files={"test":"dataset.eval.csv"},
delimiter=",")["test"]
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict, batched=True, batch_size=4)
wer = load_metric("wer")
print("WER: {:.2f}".format(wer.compute(
predictions=result["prediction"],
references=result["reference"])))
```
*Result (WER) on common-voice 6.1*:
| cleaned | other |
|:---:|:---:|
| 0.06 | 0.16 |
## Citation
If you want to cite this model you can use this:
```bibtex
?
```
### Contributions
Thanks to [@sadrasabouri](https://github.com/sadrasabouri) and [@sarasadeghii](https://github.com/Sarasadeghii) for adding this dataset.
|