arxiv:2410.02678

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Published on Oct 3

· Submitted by

WillHeld on Oct 4

Upvote

Authors:

William Held ,

Ella Li ,

Michael Ryan ,

Weiyan Shi ,

Yanzhe Zhang ,

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

View arXiv page View PDF Add to collection

Community

WillHeld

Paper author Paper submitter Oct 4

•

edited Oct 5

Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Despite needing only ASR data, DiVA achieves a 72% win rate over Qwen 2 Audio, a strong baseline trained with over 100x more compute.

Besides our benchmarking and user study, we're excited to make the model available for folks to try themselves: https://huggingface.co/spaces/WillHeld/diva-audio.

This is all in addition to our original release in July of the model weights, training code, inference code, and raw evaluation results!