arxiv:2210.14446

Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Published on Oct 26, 2022

Authors:

Abstract

Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeouts or voice activity detectors (VADs), which are both limited to acoustic features. This segmentation is often overly aggressive, given that people naturally pause to think as they speak. Consequently, segmentation happens mid-sentence, hindering both punctuation and downstream tasks like machine translation for which high-quality segmentation is critical. Model-based <PRE_TAG>segmentation</POST_TAG> methods that leverage acoustic features are powerful, but without an understanding of the language itself, these approaches are limited. We present a hybrid approach that leverages both acoustic and language information to improve segmentation. Furthermore, we show that including one word as a look-ahead boosts segmentation quality. On average, our models improve <PRE_TAG>segmentation-F0.5 score</POST_TAG> by 9.8% over baseline. We show that this approach works for multiple languages. For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2210.14446 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2210.14446 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2210.14446 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.