sxdxfan commited on
Commit
24e33e9
Β·
1 Parent(s): e23fc14

Add article link

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -48,7 +48,7 @@ For more details, see the [**GitHub Repository**](https://github.com/voicekit-te
48
 
49
  **Word Error Rate ([WER](https://huggingface.co/spaces/evaluate-metric/wer))** is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as the percentage of incorrectly recognized words compared to a reference transcript. A lower value indicates higher accuracy. T-one demonstrates state-of-the-art performance, especially on its target domain of telephony, while remaining competitive on general-purpose benchmarks.
50
 
51
- | Category | T-one (70M) | GigaAM-RNNT v2 (243M) | GigaAM-CTC v2 (242M) | Vosk-model-ru 0.54 (65M) | Vosk-model-small-streaming-ru 0.54 (20M) | Whisper large-v3 (1540M) |
52
  |:--|:--|:--|:--|--:|:--|:--|
53
  | Call-center | **8.63** | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
54
  | Other telephony | **6.20** | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
@@ -118,7 +118,7 @@ For a complete guide please refer to the [**fine-tuning example notebook**](http
118
  ## πŸŽ™ Acoustic model
119
 
120
  ### Architecture
121
- T-one is a 70M parameter acoustic model based on the **Conformer** architecture, with several key innovations to improve performance and efficiency:
122
  - **SwiGLU Activation:** The feed-forward module is replaced with a SwiGLU module for better performance.
123
  - **Modern Normalization:** SiLU (Swish) activations and RMSNorm are used in place of ReLU and LayerNorm.
124
  - **RoPE Embeddings:** Relative positional embeddings from Transformer-XL are replaced with faster Rotary Position Embeddings (RoPE).
@@ -134,7 +134,7 @@ The model supports streaming inference, which means it can process long audio fi
134
  The primary use case for this model is streaming speech recognition of calls. The user sends small audio chunks to the model, and it processes each segment incrementally, returning the finalized text and word-level timestamps in real time.
135
  T-one can be easily fine-tuned for specific domains.
136
 
137
- For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying article (link will be shared shortly). Also refer to our **technical deep dive** on how to improve quality and training speed of a streaming ASR model on [**YouTube**](https://www.youtube.com/watch?v=OQD9o1MdFRE).
138
 
139
  ## πŸ“‰ Training details
140
 
 
48
 
49
  **Word Error Rate ([WER](https://huggingface.co/spaces/evaluate-metric/wer))** is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as the percentage of incorrectly recognized words compared to a reference transcript. A lower value indicates higher accuracy. T-one demonstrates state-of-the-art performance, especially on its target domain of telephony, while remaining competitive on general-purpose benchmarks.
50
 
51
+ | Category | T-one (71M) | GigaAM-RNNT v2 (243M) | GigaAM-CTC v2 (242M) | Vosk-model-ru 0.54 (65M) | Vosk-model-small-streaming-ru 0.54 (20M) | Whisper large-v3 (1540M) |
52
  |:--|:--|:--|:--|--:|:--|:--|
53
  | Call-center | **8.63** | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
54
  | Other telephony | **6.20** | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
 
118
  ## πŸŽ™ Acoustic model
119
 
120
  ### Architecture
121
+ T-one is a 71M parameter acoustic model based on the **Conformer** architecture, with several key innovations to improve performance and efficiency:
122
  - **SwiGLU Activation:** The feed-forward module is replaced with a SwiGLU module for better performance.
123
  - **Modern Normalization:** SiLU (Swish) activations and RMSNorm are used in place of ReLU and LayerNorm.
124
  - **RoPE Embeddings:** Relative positional embeddings from Transformer-XL are replaced with faster Rotary Position Embeddings (RoPE).
 
134
  The primary use case for this model is streaming speech recognition of calls. The user sends small audio chunks to the model, and it processes each segment incrementally, returning the finalized text and word-level timestamps in real time.
135
  T-one can be easily fine-tuned for specific domains.
136
 
137
+ For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying [**article**](https://habr.com/ru/companies/tbank/articles/929850). Also refer to our **technical deep dive** on how to improve quality and training speed of a streaming ASR model on [**YouTube**](https://www.youtube.com/watch?v=OQD9o1MdFRE).
138
 
139
  ## πŸ“‰ Training details
140