Model Card for Diva Llama 3
This is an ablation of our Distilled Voice Assistant (DiVA) model which can handle speech and text as inputs. This ablation is trained using only token-alignment loss as described in the ablations here: https://huggingface.co/papers/2410.02678
Weights and Biases Run: https://wandb.ai/i18nlp/DiVA%20Training%20Runs/runs/4t0mvbcd?nw=nwuserheld
Citation
This is the token-alignment only model from https://huggingface.co/papers/2410.02678 BibTeX:
@misc{DiVA,
title={{D}istilling an {E}nd-to-{E}nd {V}oice {A}ssistant {W}ithout {I}nstruction {T}raining {D}ata},
author={William Held and Ella Li and Michael Ryan and Weiyan Shi and Yanzhe Zhang and Diyi Yang},
year={2024},
eprint={2410.02678},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.02678},
}
Table of Contents
- Model Card for DiVA Llama 3
- Citation
- Table of Contents
- Training Details
- Environmental Impact
- Technical Specifications [optional]
- Model Card Contact
Training Details
Training Data
This model was trained on the CommonVoice corpus.
Training Procedure
This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.
Environmental Impact
- Hardware Type: V4-32 TPU
- Hours used: 8 Hours
- Cloud Provider: Google Cloud.
- Compute Region: US Central C
Hardware
This model was trained on at V4 TPU on Google Cloud.
Software
This model was trained with Levanter
Model Card Authors [optional]
Will Held