arxiv:2412.01861

Late fusion ensembles for speech recognition on diverse input audio representations

Published on Dec 1, 2024

Authors:

Abstract

We explore diverse representations of <PRE_TAG>speech audio</POST_TAG>, and their effect on a performance of <PRE_TAG>late fusion</POST_TAG> <PRE_TAG>ensemble</POST_TAG> of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that <PRE_TAG><PRE_TAG>ensemble</POST_TAG> methods</POST_TAG> often improve the performance of the system even for speech recognition, it is very interesting to explore how <PRE_TAG>ensemble</POST_TAG>s of complex <PRE_TAG>state-of-the-art</POST_TAG> models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input <PRE_TAG>speech audio</POST_TAG>. The results are evaluated on four widely-used benchmark datasets: <PRE_TAG>Librispeech</POST_TAG>, Aishell, Gigaspeech, TEDLIUMv2 and show that improvements of 1% - 14% can still be achieved over the <PRE_TAG>state-of-the-art</POST_TAG> models trained using comparable techniques on these datasets. A noteworthy observation is that such <PRE_TAG>ensemble</POST_TAG> offers improvements even with the use of <PRE_TAG><PRE_TAG>language models</POST_TAG></POST_TAG>, although the gap is closing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2412.01861 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2412.01861 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2412.01861 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.