arxiv:2401.07333

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Published on Jan 14, 2024

Authors:

Xiaofei Wang ,

Abstract

The language model (LM) approach based on acoustic and linguistic prompts, such as <PRE_TAG><PRE_TAG>VALL-E</POST_TAG></POST_TAG>, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and <PRE_TAG><PRE_TAG>phoneme tokens</POST_TAG></POST_TAG>; 2) challenges of fine-grained control over the synthesized speech with <PRE_TAG><PRE_TAG>autoregressive (AR) language model</POST_TAG></POST_TAG>; 3) infinite silence generation due to the nature of AR-based decoding, especially under the <PRE_TAG><PRE_TAG>greedy strategy</POST_TAG></POST_TAG>. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and <PRE_TAG><PRE_TAG><PRE_TAG>phoneme tokens</POST_TAG></POST_TAG></POST_TAG>, where <PRE_TAG><PRE_TAG>phoneme tokens</POST_TAG></POST_TAG> appear ahead of the corresponding <PRE_TAG><PRE_TAG>acoustic tokens</POST_TAG></POST_TAG>. The experimental findings reveal that our model outperforms <PRE_TAG><PRE_TAG>VALL-E</POST_TAG></POST_TAG> in terms of accuracy and delivers more stable results using both greedy and <PRE_TAG><PRE_TAG>sampling-based decoding</POST_TAG></POST_TAG> strategies. The code of ELLA-V will be open-sourced after cleanups. Audio samples are available at https://ereboas.github.io/ELLAV/.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.07333 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.07333 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.07333 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.