Automatic Speech Recognition
Transformers
English
File size: 3,269 Bytes
838dee9
 
 
 
 
 
 
 
 
 
 
 
a180739
a88630a
4090250
a88630a
a180739
 
 
 
a251204
a180739
a251204
 
 
 
 
 
a180739
 
 
a251204
a180739
 
 
 
 
a251204
a180739
 
 
a251204
a180739
a251204
 
 
a180739
 
 
a251204
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a180739
 
 
a251204
a180739
a251204
a180739
a251204
 
 
a180739
 
 
a251204
 
 
 
 
 
 
 
a180739
5e2d810
a180739
5e2d810
 
 
 
 
 
 
 
 
a180739
5e2d810
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: cc-by-2.0
datasets:
- openslr/librispeech_asr
language:
- en
metrics:
- wer
base_model:
- facebook/wav2vec2-base-960h
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

<img src="./EE.gif" align="center" width="70%">

## Model Details

### Model Description

Wav2Vec2.0 model trained with Early-Exit pipeline.

- **Developed by:** SpeectTek unit, Fondazione Bruno Kessler
- **Model type:** Wav2Vec 2.0
- **Language(s) (NLP):** English
- **Finetuned from model:** facebook/wav2vec2-base-960h
- **Repository:** https://github.com/augustgw/wav2vec2-ee
- **Paper:** Training early-exit architectures for automatic speech recognition: Fine-tuning pre-trained models or training from scratch

### Downstream Use [optional]

The model is trained for computationally efficient ASR tasks.

## Training Details

### Training Data

The model is trained using the LibriSpeech-960h dataset.

### Training Procedure

### Basic training

- Fine-tuning with only EE loss: `finetune_ee.py`
- Fine-tuning a model without early exits: `finetune_non-ee.py`
- Change `model_config = Wav2Vec2Config(num_hidden_layers=X)` to set the number of layers in the encoder. E.g., for 4-layer encoder: `model_config = Wav2Vec2Config(num_hidden_layers=4)`

#### Training Hyperparameters

`training_args = TrainingArguments(
    output_dir="./wav2vec2-ee/checkpoints/",
    evaluation_strategy="no",
    #eval_steps=1000,
    save_strategy = 'epoch',
    #eval_accumulation_steps=10,
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=1,
    num_train_epochs=100,
    weight_decay=0.01,
    push_to_hub=False,
    report_to='wandb',
    logging_strategy='steps',
    logging_steps=1000,
    dataloader_num_workers=1,
    ignore_data_skip=True,)
  `

## Evaluation

The evaluation scripts create files in the indicated output directory. `wer_results.txt` contains the layerwise WERs on the test sets indicated in the evaluation script. The remaining files contain the layerwise transcriptions of each item in each test set.

### Basic evaluation

- Normal evaluation: `eval.py path/to/model/checkpoint path/to/output/directory`
  -   For safetensors checkpoints saved by newer versions of Hugging Face, see note in `eval.py`
- Evaluation for models without early exits (evaluates only output of final layer): `eval_non-ee.py path/to/model/checkpoint path/to/output/directory`

### Results

| Exit   | Test-Clean | Dev-Clean |
|--------|------------|-----------|
| Exit(1)|   19.14    |   19.06   | 
| Exit(2)|   8.26     |   8.01    | 
| Exit(3)|   5.93     |   5.57    | 
| Exit(4)|   4.74     |   4.48    | 
| Exit(5)|   3.98     |   3.79    |
| Exit(6)|   3.95     |   3.69    |  

## Citation

```
@inproceedings{wright2024training,
  title={Training early-exit architectures for automatic speech recognition: Fine-tuning pre-trained models or training from scratch},
  author={Wright, George August and Cappellazzo, Umberto and Zaiem, Salah and Raj, Desh and Yang, Lucas Ondel and Falavigna, Daniele and Ali, Mohamed Nabih and Brutti, Alessio},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
  pages={685--689},
  year={2024},
  organization={IEEE}
}

```