metadata
license: mpl-2.0
datasets:
- mozilla-foundation/common_voice_17_0
base_model:
- meta-llama/Llama-3.2-3B-Instruct
Model Card for Diva Llama 3.2 3B
This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details in the pre-print here.
See the model in action at diva-audio.github.io or look at the full training logs on Weights&Biases.
Citation
BibTeX:
@misc{DiVA,
title={{D}istilling an {E}nd-to-{E}nd {V}oice {A}ssistant {W}ithout {I}nstruction {T}raining {D}ata},
author={William Held and Ella Li and Michael Ryan and Weiyan Shi and Yanzhe Zhang and Diyi Yang},
year={2024},
eprint={2410.02678},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.02678},
}
Inference Example
from transformers import AutoModel
import librosa
import wget
filename = wget.download(
"https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav"
)
speech_data, _ = librosa.load(filename, sr=16_000)
model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3.2-3B", trust_remote_code=True)
print(model.generate([speech_data]))
print(model.generate([speech_data], ["Reply Briefly Like A Pirate"]))
filename = wget.download(
"https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-2426554427049983479.wav"
)
speech_data2, _ = librosa.load(filename, sr=16_000)
print(
model.generate(
[speech_data, speech_data2],
["Reply Briefly Like A Pirate", "Reply Briefly Like A New Yorker"],
)
)
Table of Contents
- Model Card for DiVA Llama 3
- Citation
- Table of Contents
- Training Details
- Environmental Impact
- Technical Specifications [optional]
- Model Card Contact
Training Details
Training Data
This model was trained on the CommonVoice corpus.
Training Procedure
This model was trained for 4.3k gradient steps with a batch size of 512 Recordings and a cosine learning rate schedule from 2e-4 to zero, with a linear warmup of 70 steps.
Environmental Impact
- Hardware Type: V4-64 TPU
- Hours used: 4 Hours
- Cloud Provider: Google Cloud.
- Compute Region: US Central C
Hardware
This model was trained on at V4-64 TPU on Google Cloud.
Software
This model was trained with Levanter
Model Card Authors [optional]
Will Held