File size: 3,269 Bytes
838dee9 a180739 a88630a 4090250 a88630a a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 a251204 a180739 5e2d810 a180739 5e2d810 a180739 5e2d810 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: cc-by-2.0
datasets:
- openslr/librispeech_asr
language:
- en
metrics:
- wer
base_model:
- facebook/wav2vec2-base-960h
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
<img src="./EE.gif" align="center" width="70%">
## Model Details
### Model Description
Wav2Vec2.0 model trained with Early-Exit pipeline.
- **Developed by:** SpeectTek unit, Fondazione Bruno Kessler
- **Model type:** Wav2Vec 2.0
- **Language(s) (NLP):** English
- **Finetuned from model:** facebook/wav2vec2-base-960h
- **Repository:** https://github.com/augustgw/wav2vec2-ee
- **Paper:** Training early-exit architectures for automatic speech recognition: Fine-tuning pre-trained models or training from scratch
### Downstream Use [optional]
The model is trained for computationally efficient ASR tasks.
## Training Details
### Training Data
The model is trained using the LibriSpeech-960h dataset.
### Training Procedure
### Basic training
- Fine-tuning with only EE loss: `finetune_ee.py`
- Fine-tuning a model without early exits: `finetune_non-ee.py`
- Change `model_config = Wav2Vec2Config(num_hidden_layers=X)` to set the number of layers in the encoder. E.g., for 4-layer encoder: `model_config = Wav2Vec2Config(num_hidden_layers=4)`
#### Training Hyperparameters
`training_args = TrainingArguments(
output_dir="./wav2vec2-ee/checkpoints/",
evaluation_strategy="no",
#eval_steps=1000,
save_strategy = 'epoch',
#eval_accumulation_steps=10,
learning_rate=1e-4,
per_device_train_batch_size=16,
per_device_eval_batch_size=1,
num_train_epochs=100,
weight_decay=0.01,
push_to_hub=False,
report_to='wandb',
logging_strategy='steps',
logging_steps=1000,
dataloader_num_workers=1,
ignore_data_skip=True,)
`
## Evaluation
The evaluation scripts create files in the indicated output directory. `wer_results.txt` contains the layerwise WERs on the test sets indicated in the evaluation script. The remaining files contain the layerwise transcriptions of each item in each test set.
### Basic evaluation
- Normal evaluation: `eval.py path/to/model/checkpoint path/to/output/directory`
- For safetensors checkpoints saved by newer versions of Hugging Face, see note in `eval.py`
- Evaluation for models without early exits (evaluates only output of final layer): `eval_non-ee.py path/to/model/checkpoint path/to/output/directory`
### Results
| Exit | Test-Clean | Dev-Clean |
|--------|------------|-----------|
| Exit(1)| 19.14 | 19.06 |
| Exit(2)| 8.26 | 8.01 |
| Exit(3)| 5.93 | 5.57 |
| Exit(4)| 4.74 | 4.48 |
| Exit(5)| 3.98 | 3.79 |
| Exit(6)| 3.95 | 3.69 |
## Citation
```
@inproceedings{wright2024training,
title={Training early-exit architectures for automatic speech recognition: Fine-tuning pre-trained models or training from scratch},
author={Wright, George August and Cappellazzo, Umberto and Zaiem, Salah and Raj, Desh and Yang, Lucas Ondel and Falavigna, Daniele and Ali, Mohamed Nabih and Brutti, Alessio},
booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
pages={685--689},
year={2024},
organization={IEEE}
}
``` |