File size: 3,269 Bytes

---
license: cc-by-2.0
datasets:
- openslr/librispeech_asr
language:
- en
metrics:
- wer
base_model:
- facebook/wav2vec2-base-960h
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

<img src="./EE.gif" align="center" width="70%">

## Model Details

### Model Description

Wav2Vec2.0 model trained with Early-Exit pipeline.

- **Developed by:** SpeectTek unit, Fondazione Bruno Kessler
- **Model type:** Wav2Vec 2.0
- **Language(s) (NLP):** English
- **Finetuned from model:** facebook/wav2vec2-base-960h
- **Repository:** https://github.com/augustgw/wav2vec2-ee
- **Paper:** Training early-exit architectures for automatic speech recognition: Fine-tuning pre-trained models or training from scratch

### Downstream Use [optional]

The model is trained for computationally efficient ASR tasks.

## Training Details

### Training Data

The model is trained using the LibriSpeech-960h dataset.

### Training Procedure

### Basic training

- Fine-tuning with only EE loss: `finetune_ee.py`
- Fine-tuning a model without early exits: `finetune_non-ee.py`
- Change `model_config = Wav2Vec2Config(num_hidden_layers=X)` to set the number of layers in the encoder. E.g., for 4-layer encoder: `model_config = Wav2Vec2Config(num_hidden_layers=4)`

#### Training Hyperparameters

`training_args = TrainingArguments(
    output_dir="./wav2vec2-ee/checkpoints/",
    evaluation_strategy="no",
    #eval_steps=1000,
    save_strategy = 'epoch',
    #eval_accumulation_steps=10,
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=1,
    num_train_epochs=100,
    weight_decay=0.01,
    push_to_hub=False,
    report_to='wandb',
    logging_strategy='steps',
    logging_steps=1000,
    dataloader_num_workers=1,
    ignore_data_skip=True,)
  `

## Evaluation

The evaluation scripts create files in the indicated output directory. `wer_results.txt` contains the layerwise WERs on the test sets indicated in the evaluation script. The remaining files contain the layerwise transcriptions of each item in each test set.

### Basic evaluation

- Normal evaluation: `eval.py path/to/model/checkpoint path/to/output/directory`
  -   For safetensors checkpoints saved by newer versions of Hugging Face, see note in `eval.py`
- Evaluation for models without early exits (evaluates only output of final layer): `eval_non-ee.py path/to/model/checkpoint path/to/output/directory`

### Results

| Exit   | Test-Clean | Dev-Clean |
|--------|------------|-----------|
| Exit(1)|   19.14    |   19.06   | 
| Exit(2)|   8.26     |   8.01    | 
| Exit(3)|   5.93     |   5.57    | 
| Exit(4)|   4.74     |   4.48    | 
| Exit(5)|   3.98     |   3.79    |
| Exit(6)|   3.95     |   3.69    |  

## Citation

```
@inproceedings{wright2024training,
  title={Training early-exit architectures for automatic speech recognition: Fine-tuning pre-trained models or training from scratch},
  author={Wright, George August and Cappellazzo, Umberto and Zaiem, Salah and Raj, Desh and Yang, Lucas Ondel and Falavigna, Daniele and Ali, Mohamed Nabih and Brutti, Alessio},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
  pages={685--689},
  year={2024},
  organization={IEEE}
}

```