Papers
arxiv:2410.02678

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Published on Oct 3
· Submitted by WillHeld on Oct 4

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

Community

Paper author Paper submitter
edited Oct 5

Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Despite needing only ASR data, DiVA achieves a 72% win rate over Qwen 2 Audio, a strong baseline trained with over 100x more compute.

train.png

Besides our benchmarking and user study, we're excited to make the model available for folks to try themselves: https://huggingface.co/spaces/WillHeld/diva-audio.

This is all in addition to our original release in July of the model weights, training code, inference code, and raw evaluation results!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.02678 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 4