Distilling an End-to-End Voice Assistant Without Instruction Training Data
Abstract
Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.
Community
Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Despite needing only ASR data, DiVA achieves a 72% win rate over Qwen 2 Audio, a strong baseline trained with over 100x more compute.
Besides our benchmarking and user study, we're excited to make the model available for folks to try themselves: https://huggingface.co/spaces/WillHeld/diva-audio.
This is all in addition to our original release in July of the model weights, training code, inference code, and raw evaluation results!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data (2024)
- Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models (2024)
- SSR: Alignment-Aware Modality Connector for Speech Language Models (2024)
- Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech (2024)
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper