anteju commited on
Commit
c661777
·
1 Parent(s): 83b23d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -90,14 +90,14 @@ The tokenizers for these models were built using the text transcripts of the tra
90
 
91
  The vocabulary we use contains 27 characters:
92
  ```python
93
- ['a', 'b', 'c', 'č', 'ć', 'd', 'đ', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 'š', 't', 'u', 'v', 'z', 'ž']
94
  ```
95
 
96
- Full config can be found inside the .nemo files.
97
 
98
  ### Datasets
99
 
100
- All the models in this collection are trained on ParlaSpeech-HR v1.0 Croatian dataset, which contains around 1665 hours of training data after data cleaning, 2.2 hours of developement and 2.3 hours of test data.
101
 
102
  ## Performance
103
 
@@ -105,13 +105,13 @@ The list of the available models in this collection is shown in the following ta
105
 
106
  | Version | Tokenizer | Vocabulary Size | Dev WER | Test WER | Train Dataset |
107
  |---------|-----------------------|-----------------|---------|----------|---------------------|
108
- | 1.11.0 | SentencePiece Unigram | 128 | X.YZ | X.YZ | ParlaSpeech-HR v1.0 |
109
 
110
  You may use language models (LMs) and beam search to improve the accuracy of the models.
111
 
112
  ## Limitations
113
 
114
- Since the model is trained just on ParlaSpeech-HR v1.0 dataset, the performance of this model might degrade for speech which includes terms, or vernecular that the model has not been trained on. The model might also perform worse for accented speech.
115
 
116
  ## Deployment with NVIDIA Riva
117
 
 
90
 
91
  The vocabulary we use contains 27 characters:
92
  ```python
93
+ [' ', 'a', 'b', 'c', 'č', 'ć', 'd', 'đ', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 'š', 't', 'u', 'v', 'z', 'ž']
94
  ```
95
 
96
+ Full config can be found inside the `.nemo` files.
97
 
98
  ### Datasets
99
 
100
+ All the models in this collection are trained on ParlaSpeech-HR v1.0 Croatian dataset, which contains around 1665 hours of training data after data cleaning, 2.2 hours of development and 2.3 hours of test data.
101
 
102
  ## Performance
103
 
 
105
 
106
  | Version | Tokenizer | Vocabulary Size | Dev WER | Test WER | Train Dataset |
107
  |---------|-----------------------|-----------------|---------|----------|---------------------|
108
+ | 1.11.0 | SentencePiece Unigram | 128 | 4.43 | 4.70 | ParlaSpeech-HR v1.0 |
109
 
110
  You may use language models (LMs) and beam search to improve the accuracy of the models.
111
 
112
  ## Limitations
113
 
114
+ Since the model is trained just on ParlaSpeech-HR v1.0 dataset, the performance of this model might degrade for speech which includes terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
115
 
116
  ## Deployment with NVIDIA Riva
117