--- language: - en datasets: - mozilla-foundation/common_voice_13_0 - facebook/voxpopuli - LIUM/tedlium - librispeech_asr - fisher_corpus - WSJ-0 metrics: - wer pipeline_tag: automatic-speech-recognition model-index: - name: tbd results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - type: wer value: 2.5 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - type: wer value: 5.6 name: Test WER - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: tedlium-v3 type: LIUM/tedlium config: release1 split: test args: language: en metrics: - type: wer value: 6.3 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Vox Populi type: facebook/voxpopuli config: en split: test args: language: en metrics: - type: wer value: 7.3 name: Test WER - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: Mozilla Common Voice 13.0 type: mozilla-foundation/common_voice_13_0 config: en split: test args: language: en metrics: - type: wer value: 12.1 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: FLEURS type: google/fleurs split: test args: language: en_us metrics: - type: wer value: 6.8 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Switchboard type: unk split: eval2000 args: language: en metrics: - type: wer value: 6.8 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Wall Street Journal type: unk split: eval92 args: language: en metrics: - type: wer value: 1.3 name: Test WER --- # DeCRED-base This is a **174M encoder-decoder Ebranchformer model** trained with an decoder-centric regularization technique on 6,000 hours of open-source normalised English data. It achieves Word Error Rates (WERs) comparable to `openai/whisper-medium` across multiple datasets with just 1/4 of the parameters. Architecture details, training hyperparameters, and a description of the proposed technique will be added soon. *Disclaimer: The model currently produce insertions on utterances containing silence only, as it was previously not trained on such data. The fix will be added soon.* The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) class to transcribe audio files of arbitrary length. ```python from transformers import pipeline model_id = "BUT-FIT/DeCRED-base" pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True) # In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type. # The warning can be ignored. pipe.type = "seq2seq" # Run beam search decoding with joint CTC-attention scorer result_beam = pipe("audio.wav") # Run greedy decoding without joint CTC-attention scorer pipe.model.generation_config.ctc_weight = 0.0 pipe.model.generation_config.num_beams = 1 result_greedy = pipe("audio.wav") ``` ## Citation If you use [DeCRED](https://arxiv.org/abs/2410.17437) in your research, please cite the following paper: ```bibtex @misc{polok2024improvingautomaticspeechrecognition, title={Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models}, author={Alexander Polok and Santosh Kesiraju and Karel Beneš and Lukáš Burget and Jan Černocký}, year={2024}, eprint={2410.17437}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2410.17437}, } ```