transiteration commited on
Commit
a9ba0ae
·
1 Parent(s): 04bec1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -5
README.md CHANGED
@@ -16,8 +16,8 @@ tags:
16
 
17
  ## Model Overview
18
 
19
- In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo.
20
- We advise installing it once you've already installed the most recent version of Pytorch.
21
  ```
22
  pip install nemo_toolkit['all']
23
  ```
@@ -47,13 +47,36 @@ python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo dataset_manife
47
 
48
  ## Input and Output
49
 
50
- This model can take input in the form of mono-channel audio .WAV files with a sample rate of
51
- 16,000 KHz. Then, this model gives you the spoken words in a text format for a given audio sample.
52
 
53
  ## Model Architecture
54
 
55
- QuartzNet [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
56
 
 
57
 
 
58
 
 
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Model Overview
18
 
19
+ In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo Toolkit [1].
20
+ We advise installing it once you've installed the most recent version of Pytorch.
21
  ```
22
  pip install nemo_toolkit['all']
23
  ```
 
47
 
48
  ## Input and Output
49
 
50
+ This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.
51
+ Then, this model gives you the spoken words in a text format for a given audio sample.
52
 
53
  ## Model Architecture
54
 
55
+ QuartzNet 15x5 [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
56
 
57
+ ## Training
58
 
59
+ The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
60
 
61
+ ## Dataset
62
 
63
+ Kazakh Speech Corpus 2 (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.
64
+ In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.
65
+
66
+ ## Performance
67
+
68
+ Average WER: 15.53%
69
+
70
+ ## Limitation
71
+
72
+ Because the GPU (NVIDIA GeForce RTX 2070) has limited power, we used a lightweight model architecture for fine-tuning.
73
+ In general, this makes it faster for inference but might show less overall performance.
74
+ In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.
75
+
76
+ ## References
77
+
78
+ [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
79
+
80
+ [2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)
81
+
82
+ [3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)