File size: 8,276 Bytes
c9bccb0
 
 
5358d40
c9bccb0
 
5358d40
 
 
c9bccb0
5358d40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9bccb0
 
 
 
 
 
5358d40
 
 
 
 
 
c9bccb0
 
 
5358d40
c9bccb0
 
 
5358d40
 
c9bccb0
 
 
5358d40
c9bccb0
 
 
5358d40
 
c9bccb0
 
 
 
 
 
 
5358d40
 
c9bccb0
 
 
5660e05
c9bccb0
5358d40
 
c9bccb0
 
 
5358d40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9bccb0
 
 
 
 
5358d40
c9bccb0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- google/fleurs
metrics:
- wer
model-index:
- name: Whisper Small Amharic FLEURS
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: google/fleurs am_et
      type: google/fleurs
      config: am_et
      split: test+validation
      args: am_et
    metrics:
    - name: Wer
      type: wer
      value: 160.99
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->


# Whisper Small Tamil FLEURS

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the google/fleurs am_et dataset.
It achieves the following results on the evaluation set:
- Loss: -
- Wer: 160.99

## Model description

- The main Whisper Small Hugging Face page: [Hugging Face - Whisper Small](https://huggingface.co/openai/whisper-small)

## Intended uses & limitations

- For experimentation and curiosity. 
- Based on the paper [AXRIV](https://arxiv.org/abs/2212.04356) and [Benchmarking OpenAI Whisper for non-English ASR - Dan Shafer](https://blog.deepgram.com/benchmarking-openai-whisper-for-non-english-asr/), there is a performance bias towards certain languages and curated datasets.

## Training and evaluation data

- This model was trained/evaluated on "test+validation" data from google/fleurs [google/fluers - HuggingFace Datasets](https://huggingface.co/datasets/google/fleurs).

## Training procedure

- The training was done in Lambda Cloud GPU on A100/40GB GPUs, which were provided by OpenAI Community Events [Whisper Fine Tuning Event - Dec 2022](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#fine-tune-whisper). The training was done using [HuggingFace Community Events - Whisper - run_speech_recognition_seq2seq_streaming.py](https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/run_speech_recognition_seq2seq_streaming.py) using the included [whisper_python_am_et.ipynb](https://huggingface.co/drmeeseeks/whisper-small-am_et/blob/main/am_et_fine_tune_whisper_streaming_colab_RUNNING-evalerrir.ipynb) to setup the Lambda Cloud GPU/Colab environment. For Colab, you must reduce the train batch size to the recommended amount mentioned at , as the T4 GPUs have 16GB of memory [Whisper Fine Tuning Event - Dec 2022](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#fine-tune-whisper). The notebook sets up the environment, logs into your huggingface account, and generates a bash script. The bash script generated in the IPYNB, `run.sh` was run from the terminal to train `bash run.sh`, as described on the Whisper community events GITHUB page. Num Examples = 446, Num Epochs = 715, Number of trainable parameters = 241734912.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 64
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 1
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 1000
- mixed_precision_training: Native AMP
- do_eval=False


### Training results

5000 Steps

| Training Loss | Epoch | Step | Learning Rate |
|:-------------:|:-----:|:-----:|:-----:|
| 3.0968        | 0.05 | - | 4.2e-7 |
| 1.178        | 28.57 | - | 3.92e-6 |
| 0.03        | 53.57 | - | 7.42e-6 |
| 0.0002        | 217.86 | - | 7.73e-6 |
| 0.0001        | 378.57 | ~ 2000 | 5.23e-6 |
| 0.0000        | 382.14 | - | 5.14e-6 |
| 0.0000        | 467.86 | 3300 | 3.84e-6|
| 0.0000        | 614.29 | 4300 | 1.56e-6|
| 0.0000        | 685.71 | 4812 | 4.53e-7|
| 0.0000        | 710.71 | 4997 | 6.44e-8|

3000 Steps

| Training Loss | Epoch | Step | Learning Rate |
|:-------------:|:-----:|:-----:|:-----:|
| 3.0968        | 0.05 | - | 4.2e-7 |
| 0.0017        | 96.43 | 687 | 9.316e-6 |


### Recommendations 

Limit training duration for smaller datasets to ~ 2000 to 3000 steps to avoid overfitting. 5000 steps using the [HuggingFace - Whisper Small](https://huggingface.co/openai/whisper-small) takes ~ 5hrs on A100 GPUs (1hr/1000 steps). Encountered `RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1` which is related to [Trainer RuntimeError](https://discuss.huggingface.co/t/trainer-runtimeerror-the-size-of-tensor-a-462-must-match-the-size-of-tensor-b-448-at-non-singleton-dimension-1/26010) as some languages datasets have input lengths that have non-standard lengths. The link did not resolve my issue, and appears elsewhere too [Training languagemodel – RuntimeError the expanded size of the tensor (100) must match the existing size (64) at non singleton dimension 1](https://hungsblog.de/en/technology/troubleshooting/training-languagemodel-runtimeerror-the-expanded-size-of-the-tensor-100-must-match-the-existing-size-64-at-non-singleton-dimension-1/). To circumvent this issue, `run.sh` only trains and saves the model (if you make changes to `run.sh` be sure to clear/rm the contents as piping appends). Then run `python run_eval_whisper_streaming.py --model_id="openai/whisper-small" --dataset="google/fleurs" --config="am_et" --batch_size=32 --max_eval_samples=64 --device=0 --language="am"` to find the WER score. Erroring out during evaluation prevents the trained model from loading to HugginFace. Based on the paper [AXRIV](https://arxiv.org/abs/2212.04356) and [Benchmarking OpenAI Whisper for non-English ASR - Dan Shafer](https://blog.deepgram.com/benchmarking-openai-whisper-for-non-english-asr/), there is a performance bias towards certain languages and curated datasets. The OpenAI fintuning community event provided ample _free_ GPU time to help develop the model further and improve WER scores.

### Environmental Impact

Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). In total roughly 100 hours were used primarily in US East/Asia Pacific (80%/20%), with AWS as the reference. Additional resources are available at [Our World in Data - CO2 Emissions](https://ourworldindata.org/co2-emissions) 

- __Hardware Type__: AMD EPYC 7J13 64-Core Processor (30 core VM) 197GB RAM, with NVIDIA A100-SXM 40GB
- __Hours Used__: 100 hrs
- __Cloud Provider__: Lambda Cloud GPU 
- __Compute Region__: US East/Asia Pacific
- __Carbon Emitted__: 12 kg (GPU) + 13 kg (CPU) = 25 kg (the weight of 3 gallons of water)


### Citation

- [Whisper - GITHUB](https://github.com/openai/whisper)
- [Whisper - OpenAI - BLOG](https://openai.com/blog/whisper/)
- [Model Card - HuggingFace Hub - GITHUB](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md)

```bibtex
@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

@article{owidco2andothergreenhousegasemissions,
    author = {Hannah Ritchie and Max Roser and Pablo Rosado},
    title = {CO₂ and Greenhouse Gas Emissions},
    journal = {Our World in Data},
    year = {2020},
    note = {https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions}
}

```


### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2