NeMo
Ukrainian
File size: 2,484 Bytes
4b774d8
 
 
 
 
 
 
d5ea348
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: cc-by-nc-4.0
datasets:
- taras-sereda/uk-pods
language:
- uk
library_name: nemo
---

## Usage

The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.

```
pip install nemo_toolkit['all']
```

### Automatically instantiate the model

```python
from nemo.collections.asr.models import EncDecCTCModelBPE
asr_model = EncDecCTCModelBPE.from_pretrained("taras-sereda/uk-pods-conformer")
```

### Transcribing using Python
First, let's get a sample
```
wget "https://huggingface.co/datasets/taras-sereda/uk-pods/resolve/main/example/e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav?download=true" -O e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav
```
Then simply do:
```
asr_model.transcribe(['e934c3e4-c37b-4607-98a8-22cdff933e4a_0266.wav'])
```

### Input

This model accepts 16000 kHz Mono-channel Audio (wav files) as input.

### Output

This model provides transcribed speech as a string for a given audio sample.

## Model Architecture

Conformer-CTC model is a non-autoregressive variant of Conformer model [2] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: [Conformer-CTC Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-ctc). 



### Datasets

This model has been trained using a combination of 2 datasets:

    - UK-PODS [3] train dataset: This dataset comprises 46 hours of conversational speech collected from Ukrainian podcasts.
    - Validated Mozilla Common Voice Corpus 10.0: (excluding dev and test data) dataset that includes 50.1 hours of Ukrainian speech.

## Performance

Performances of the ASR model is reported in terms of Word Error Rate (WER) with greedy decoding.

| Tokenizer     | Vocabulary Size  | UK-PODS test | MCV-10 test |
|:-------------:| :--------------: | :----------: | :---------: |
| SentencePiece | 1024             | 0.093        | 0.116       |

## References

- [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

- [2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)

- [3] [UK-PODS](https://huggingface.co/datasets/taras-sereda/uk-pods)