Add article link
Browse files
README.md
CHANGED
@@ -48,7 +48,7 @@ For more details, see the [**GitHub Repository**](https://github.com/voicekit-te
|
|
48 |
|
49 |
**Word Error Rate ([WER](https://huggingface.co/spaces/evaluate-metric/wer))** is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as the percentage of incorrectly recognized words compared to a reference transcript. A lower value indicates higher accuracy. T-one demonstrates state-of-the-art performance, especially on its target domain of telephony, while remaining competitive on general-purpose benchmarks.
|
50 |
|
51 |
-
| Category | T-one (
|
52 |
|:--|:--|:--|:--|--:|:--|:--|
|
53 |
| Call-center | **8.63** | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
|
54 |
| Other telephony | **6.20** | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
|
@@ -118,7 +118,7 @@ For a complete guide please refer to the [**fine-tuning example notebook**](http
|
|
118 |
## π Acoustic model
|
119 |
|
120 |
### Architecture
|
121 |
-
T-one is a
|
122 |
- **SwiGLU Activation:** The feed-forward module is replaced with a SwiGLU module for better performance.
|
123 |
- **Modern Normalization:** SiLU (Swish) activations and RMSNorm are used in place of ReLU and LayerNorm.
|
124 |
- **RoPE Embeddings:** Relative positional embeddings from Transformer-XL are replaced with faster Rotary Position Embeddings (RoPE).
|
@@ -134,7 +134,7 @@ The model supports streaming inference, which means it can process long audio fi
|
|
134 |
The primary use case for this model is streaming speech recognition of calls. The user sends small audio chunks to the model, and it processes each segment incrementally, returning the finalized text and word-level timestamps in real time.
|
135 |
T-one can be easily fine-tuned for specific domains.
|
136 |
|
137 |
-
For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying article
|
138 |
|
139 |
## π Training details
|
140 |
|
|
|
48 |
|
49 |
**Word Error Rate ([WER](https://huggingface.co/spaces/evaluate-metric/wer))** is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as the percentage of incorrectly recognized words compared to a reference transcript. A lower value indicates higher accuracy. T-one demonstrates state-of-the-art performance, especially on its target domain of telephony, while remaining competitive on general-purpose benchmarks.
|
50 |
|
51 |
+
| Category | T-one (71M) | GigaAM-RNNT v2 (243M) | GigaAM-CTC v2 (242M) | Vosk-model-ru 0.54 (65M) | Vosk-model-small-streaming-ru 0.54 (20M) | Whisper large-v3 (1540M) |
|
52 |
|:--|:--|:--|:--|--:|:--|:--|
|
53 |
| Call-center | **8.63** | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
|
54 |
| Other telephony | **6.20** | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
|
|
|
118 |
## π Acoustic model
|
119 |
|
120 |
### Architecture
|
121 |
+
T-one is a 71M parameter acoustic model based on the **Conformer** architecture, with several key innovations to improve performance and efficiency:
|
122 |
- **SwiGLU Activation:** The feed-forward module is replaced with a SwiGLU module for better performance.
|
123 |
- **Modern Normalization:** SiLU (Swish) activations and RMSNorm are used in place of ReLU and LayerNorm.
|
124 |
- **RoPE Embeddings:** Relative positional embeddings from Transformer-XL are replaced with faster Rotary Position Embeddings (RoPE).
|
|
|
134 |
The primary use case for this model is streaming speech recognition of calls. The user sends small audio chunks to the model, and it processes each segment incrementally, returning the finalized text and word-level timestamps in real time.
|
135 |
T-one can be easily fine-tuned for specific domains.
|
136 |
|
137 |
+
For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying [**article**](https://habr.com/ru/companies/tbank/articles/929850). Also refer to our **technical deep dive** on how to improve quality and training speed of a streaming ASR model on [**YouTube**](https://www.youtube.com/watch?v=OQD9o1MdFRE).
|
138 |
|
139 |
## π Training details
|
140 |
|