Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- th
|
5 |
+
pipeline_tag: automatic-speech-recognition
|
6 |
+
---
|
7 |
+
|
8 |
+
# Whisper-base Thai finetuned
|
9 |
+
|
10 |
+
## 1) Environment Setup
|
11 |
+
```bash
|
12 |
+
# visit https://pytorch.org/get-started/locally/ to install pytorch
|
13 |
+
pip3 install transformers librosa
|
14 |
+
```
|
15 |
+
|
16 |
+
## 2) Usage
|
17 |
+
```python
|
18 |
+
from transformers import WhisperForConditionalGeneration, WhisperProcessor
|
19 |
+
import librosa
|
20 |
+
|
21 |
+
device = "cuda" # cpu, cuda
|
22 |
+
|
23 |
+
model = WhisperForConditionalGeneration.from_pretrained("juierror/whisper-tiny-thai").to(device)
|
24 |
+
processor = WhisperProcessor.from_pretrained("juierror/whisper-tiny-thai", language="Thai", task="transcribe")
|
25 |
+
|
26 |
+
path = "/path/to/audio/file"
|
27 |
+
|
28 |
+
def inference(path: str) -> str:
|
29 |
+
"""
|
30 |
+
Get the transcription from audio path
|
31 |
+
|
32 |
+
Args:
|
33 |
+
path(str): path to audio file (can be load with librosa)
|
34 |
+
|
35 |
+
Returns:
|
36 |
+
str: transcription
|
37 |
+
"""
|
38 |
+
audio, sr = librosa.load(path, sr=16000)
|
39 |
+
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
|
40 |
+
generated_tokens = model.generate(
|
41 |
+
input_features=input_features.to(device),
|
42 |
+
max_new_tokens=255,
|
43 |
+
language="Thai"
|
44 |
+
).cpu()
|
45 |
+
transcriptions = processor.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
46 |
+
return transcriptions[0]
|
47 |
+
|
48 |
+
print(inference(path=path))
|
49 |
+
```
|
50 |
+
|
51 |
+
## 3) Evaluate Result
|
52 |
+
This model has been trained and evaluated on three datasets:
|
53 |
+
|
54 |
+
- Common Voice 13
|
55 |
+
- [Gowajee Corpus](https://github.com/ekapolc/gowajee_corpus)
|
56 |
+
```
|
57 |
+
@techreport{gowajee,
|
58 |
+
title = {{Gowajee Corpus}},
|
59 |
+
author = {Ekapol Chuangsuwanich and Atiwong Suchato and Korrawe Karunratanakul and Burin Naowarat and Chompakorn CChaichot
|
60 |
+
and Penpicha Sangsa-nga and Thunyathon Anutarases and Nitchakran Chaipojjana},
|
61 |
+
year = {2020},
|
62 |
+
institution = {Chulalongkorn University, Faculty of Engineering, Computer Engineering Department},
|
63 |
+
month = {12},
|
64 |
+
Date-Added = {2021-07-20},
|
65 |
+
url = {https://github.com/ekapolc/gowajee_corpus}
|
66 |
+
note = {Version 0.9.2}
|
67 |
+
}
|
68 |
+
```
|
69 |
+
- [Thai Elderly Speech](https://github.com/VISAI-DATAWOW/Thai-Elderly-Speech-dataset/releases/tag/v1.0.0)
|
70 |
+
|
71 |
+
The Common Voice dataset has been cleaned and divided into training, testing, and development sets. Care has been taken to ensure that the sentences in each set are unique and do not have any duplicates.
|
72 |
+
The Gowajee dataset has already been pre-split into training, development, and testing sets, allowing for direct utilization.
|
73 |
+
As for the Thai Elderly Speech dataset, I performed a random split.
|
74 |
+
The Character Error Rate (CER) is calculated by removing spaces in both the labels and predicted text, and then computing the CER.
|
75 |
+
The Word Error Rate (WER) is calculated using the PythaiNLP newmm tokenizer to tokenize both the labels and predicted text, and then computing the WER.
|
76 |
+
|
77 |
+
These are the results.
|
78 |
+
|
79 |
+
| Dataset | WER | CER |
|
80 |
+
|-----------------------------------|-------|------|
|
81 |
+
| Common Voice 13 | 26.48 | 7.83 |
|
82 |
+
| Gowajee | 25.39 | 11.67 |
|
83 |
+
| Thai Elderly Speech (Smart Home) | 14.85 | 4.47 |
|
84 |
+
| Thai Elderly Speech (Health Care) | 15.23 | 4.05 |
|