|
--- |
|
language: |
|
- de |
|
license: cc-by-4.0 |
|
library_name: nemo |
|
datasets: |
|
- mozilla-foundation/common_voice_7_0 |
|
- Multilingual LibriSpeech (2000 hours) |
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- CTC |
|
- Conformer |
|
- Transformer |
|
- NeMo |
|
- pytorch |
|
model-index: |
|
- name: stt_de_conformer_transducer_large |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
dataset: |
|
type: common_voice_7_0 |
|
name: mozilla-foundation/common_voice_7_0 |
|
config: other |
|
split: test |
|
args: |
|
lageangu: de |
|
metrics: |
|
- type: wer |
|
value: 4.93 |
|
name: WER |
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
<DESCRIBE IN ONE LINE THE MODEL AND ITS USE> |
|
|
|
## NVIDIA NeMo: Training |
|
|
|
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version. |
|
``` |
|
pip install nemo_toolkit['all'] |
|
``` |
|
|
|
## How to Use this Model |
|
|
|
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. |
|
|
|
### Automatically instantiate the model |
|
|
|
```python |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.ASRModel.from_pretrained("iqbalc/stt_de_conformer_transducer_large") |
|
``` |
|
|
|
### Transcribing using Python |
|
``` |
|
asr_model.transcribe(['filename.wav']) |
|
|
|
``` |
|
|
|
### Transcribing many audio files |
|
|
|
```shell |
|
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="iqbalc/stt_de_conformer_transducer_large" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" |
|
``` |
|
|
|
### Input |
|
|
|
This model accepts 16000 KHz Mono-channel Audio (wav files) as input. |
|
|
|
### Output |
|
|
|
This model provides transcribed speech as a string for a given audio sample. |
|
|
|
## Model Architecture |
|
|
|
Conformer-Transducer model is an autoregressive variant of Conformer model for Automatic Speech Recognition which uses Transducer loss/decoding |
|
|
|
## Training |
|
|
|
The NeMo toolkit was used for training the models. These models are fine-tuned with this example script and this base config. |
|
|
|
The tokenizers for these models were built using the text transcripts of the train set with this script. |
|
|
|
### Datasets |
|
|
|
All the models in this collection are trained on a composite dataset comprising of over two thousand hours of cleaned German speech: |
|
|
|
1. MCV7.0 567 hours |
|
2. MLS 1524 hours |
|
3. VoxPopuli 214 hours |
|
|
|
## Performance |
|
|
|
Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding. |
|
|
|
MCV7.0 test = 4.93 |
|
|
|
## Limitations |
|
|
|
The model might perform worse for accented speech |
|
|
|
|
|
## References |
|
[NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |