Transformers
PyTorch
English
whisper
Eval Results
Inference Endpoints
hajekad commited on
Commit
9c19a17
·
1 Parent(s): b6ff432

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -18
README.md CHANGED
@@ -4,30 +4,33 @@ datasets:
4
  - AudioCaps
5
  - Clotho-v2.1
6
  metrics:
7
- - TODO
 
 
 
 
 
8
  model-index:
9
- - name: whisper-TODO-audio-captioning
10
  results:
11
  - task:
12
  type: audio-captioning
13
  name: Audio Captioning
14
  dataset:
15
- type: TODO
16
- name: TODO
17
- split: TODO
18
  metrics:
19
- - type: Spider
20
- value: TODO
21
  - type: SPICE
22
- value: TODO
23
  - type: CIDEr
24
- value: TODO
25
  - type: SPIDEr
26
- value: TODO
27
  - type: METEOR
28
- value: TODO
29
  - type: SacreBLEU
30
- value: TODO
31
  license: cc-by-nc-4.0
32
  language:
33
  - en
@@ -41,7 +44,7 @@ A transformer encoder-decoder model for automatic audio captioning. As opposed t
41
  - **Model type:** Whisper encoder-decoder transformer
42
  - **Language(s) (NLP):** en
43
  - **License:** cc-by-4.0
44
- - **Parent Model:** openai/whisper-TODO
45
  - **Resources for more information:**
46
  - [GitHub Repo](https://github.com/prompteus/audio-captioning)
47
  - [Technical Report](TODO)
@@ -55,14 +58,14 @@ Minimal example:
55
 
56
  ```python3
57
  # Load model
58
- architecture = "openai/whisper-TODO"
59
- checkpoint = "TODO"
60
  model = audiocap.WhisperForAudioCaptioning.from_pretrained(checkpoint)
61
  tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
62
  feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
63
 
64
  # Load and preprocess audio
65
- input_file = "TODO"
66
  audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
67
  features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
68
 
@@ -93,9 +96,9 @@ Our model class `WhisperForAudioCaptioning` can be found in our git repository o
93
 
94
  ## Training details
95
 
96
- The model was initialized by original speech-to-text `openai/whisper-TODO` weights. Then, it was pretrained on a mix of (1) subset of AudioSet with synthetic labels, (2) AudioCaps captioning dataset and (3) Clotho v2.1 captioning dataset. Finally, it was finetuned on Clotho v2.1 to focus the model on the specific style of the captions. For each traning input, the model was informed about the source of the data, so it can mimic the caption style in all 3 styles.
97
 
98
- During pretraining, the ratio of samples in each batch was approximately 12:3:1 (AudioSet:AudioCaps:Clotho). The pretraining took TODO steps with batch size 32 and learning rate 2e-5. Finetuning was done on Clotho only, and the model was trained for TODO steps with batch size 32 and learning rate 4e-6. All layers except *fc1* layers were frozen during finetuning.
99
 
100
  For more information about the training regime, see the [technical report](TODO).
101
 
@@ -104,6 +107,14 @@ For more information about the training regime, see the [technical report](TODO)
104
 
105
  Metrics reported in the metadata were computed on Clotho v2.1 test split with captions generated using a beam search with 5 beams.
106
 
 
 
 
 
 
 
 
 
107
 
108
  ## Limitations
109
 
 
4
  - AudioCaps
5
  - Clotho-v2.1
6
  metrics:
7
+ - SPICE
8
+ - CIDEr
9
+ - SPIDEr
10
+ - METEOR
11
+ - SacreBLEU
12
+
13
  model-index:
14
+ - name: whisper-tiny-audio-captioning
15
  results:
16
  - task:
17
  type: audio-captioning
18
  name: Audio Captioning
19
  dataset:
20
+ type: clotho-v2.1
21
+ name: Clotho
22
+ split: evaluation
23
  metrics:
 
 
24
  - type: SPICE
25
+ value: 0.1077
26
  - type: CIDEr
27
+ value: 0.3404
28
  - type: SPIDEr
29
+ value: 0.2240
30
  - type: METEOR
31
+ value: 0.3452
32
  - type: SacreBLEU
33
+ value: 13.77
34
  license: cc-by-nc-4.0
35
  language:
36
  - en
 
44
  - **Model type:** Whisper encoder-decoder transformer
45
  - **Language(s) (NLP):** en
46
  - **License:** cc-by-4.0
47
+ - **Parent Model:** openai/whisper-tiny
48
  - **Resources for more information:**
49
  - [GitHub Repo](https://github.com/prompteus/audio-captioning)
50
  - [Technical Report](TODO)
 
58
 
59
  ```python3
60
  # Load model
61
+ architecture = "openai/whisper-tiny"
62
+ checkpoint = "MU-NLPC/whiper-tiny-audio-captioning"
63
  model = audiocap.WhisperForAudioCaptioning.from_pretrained(checkpoint)
64
  tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
65
  feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
66
 
67
  # Load and preprocess audio
68
+ input_file = "..."
69
  audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
70
  features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
71
 
 
96
 
97
  ## Training details
98
 
99
+ The model was initialized by original speech-to-text `openai/whisper-tiny` weights. Then, it was pretrained on a mix of (1) subset of AudioSet with synthetic labels, (2) AudioCaps captioning dataset and (3) Clotho v2.1 captioning dataset. Finally, it was finetuned on Clotho v2.1 to focus the model on the specific style of the captions. For each traning input, the model was informed about the source of the data, so it can mimic the caption style in all 3 styles.
100
 
101
+ During pretraining, the ratio of samples in each batch was approximately 12:3:1 (AudioSet:AudioCaps:Clotho). The pretraining took 36000 steps with batch size 32 and learning rate 2e-5. Finetuning was done on Clotho only, and the model was trained for 3900 steps with batch size 32 and learning rate 4e-6. All layers except *fc1* layers were frozen during finetuning.
102
 
103
  For more information about the training regime, see the [technical report](TODO).
104
 
 
107
 
108
  Metrics reported in the metadata were computed on Clotho v2.1 test split with captions generated using a beam search with 5 beams.
109
 
110
+ | | whisper-tiny | whisper-small | whisper-large-v2 |
111
+ |----------------------|--------------|---------------|------------------|
112
+ | SacreBLEU | 13.77 | 15.76 | 16.50 |
113
+ | METEOR | 0.3452 | 0.3781 | 0.3782 |
114
+ | CIDEr | 0.3404 | 0.4142 | 0.4331 |
115
+ | SPICE | 0.1077 | 0.1234 | 0.1257 |
116
+ | SPIDEr | 0.2240 | 0.2687 | 0.2794 |
117
+
118
 
119
  ## Limitations
120