DionTimmer commited on
Commit
8680be5
1 Parent(s): 69943db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -78
README.md CHANGED
@@ -1,79 +1,76 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- language:
4
- - en
5
- library_name: transformers
6
- ---
7
- # Whisper Multitask Analyzer
8
-
9
- A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.
10
-
11
- - **Model, codebase & card adapted from:** MU-NLPC/whisper-small-audio-captioning
12
- - **Model type:** Whisper encoder-decoder transformer
13
- - **Language(s) (NLP):** en
14
- - **License:** cc-by-4.0
15
- - **Parent Model:** openai/whisper-small
16
-
17
- ## Usage
18
-
19
- The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder.
20
- The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.
21
-
22
- The tag mapping of the current model is:
23
-
24
- | Task | ID | Description |
25
- | -------- | -- | ------------------------------------------------------ |
26
- | tags | 0 | General descriptions, can include genres and features. |
27
- | genre | 1 | Estimated musical genres. |
28
- | mood | 2 | Estimated emotional feeling. |
29
- | movement | 3 | Estimated audio pace and expression. |
30
- | theme | 4 | Estimated audio usage (not very accurate) |
31
-
32
- ```
33
-
34
- Minimal example:
35
-
36
- ```python
37
- # Load model
38
- checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
39
- model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
40
- tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
41
- feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)
42
-
43
- # Load and preprocess audio
44
- input_file = "..."
45
- audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
46
- features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
47
-
48
- # Mappings by ID
49
- print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}
50
-
51
- # Inverted
52
- print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}
53
-
54
- # Prepare caption style
55
- style_prefix = f"{model.named_task_mapping['tags']}: "
56
- style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels
57
-
58
- # Generate caption
59
- model.eval()
60
- outputs = model.generate(
61
- inputs=features.to(model.device),
62
- forced_ac_decoder_ids=style_prefix_tokens,
63
- max_length=100,
64
- )
65
-
66
- print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
67
- ```
68
-
69
- Example output:
70
- *0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental*
71
-
72
- WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`.
73
-
74
- The model class `WhisperForAudioCaptioning` can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix.
75
-
76
-
77
- ## Licence
78
-
79
  The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ ---
7
+ # Whisper Multitask Analyzer
8
+
9
+ A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips.
10
+
11
+ - **Model, codebase & card adapted from:** MU-NLPC/whisper-small-audio-captioning
12
+ - **Model type:** Whisper encoder-decoder transformer
13
+ - **Language(s) (NLP):** en
14
+ - **License:** cc-by-4.0
15
+ - **Parent Model:** openai/whisper-small
16
+
17
+ ## Usage
18
+
19
+ The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder.
20
+ The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function.
21
+
22
+ The tag mapping of the current model is:
23
+
24
+ | Task | ID | Description |
25
+ | -------- | -- | ------------------------------------------------------ |
26
+ | tags | 0 | General descriptions, can include genres and features. |
27
+ | genre | 1 | Estimated musical genres. |
28
+ | mood | 2 | Estimated emotional feeling. |
29
+ | movement | 3 | Estimated audio pace and expression. |
30
+ | theme | 4 | Estimated audio usage (not very accurate) |
31
+
32
+ Minimal example:
33
+ ```python
34
+ # Load model
35
+ checkpoint = "DionTimmer/whisper-small-multitask-analyzer"
36
+ model = WhisperForAudioCaptioning.from_pretrained(checkpoint)
37
+ tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
38
+ feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint)
39
+
40
+ # Load and preprocess audio
41
+ input_file = "..."
42
+ audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
43
+ features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
44
+
45
+ # Mappings by ID
46
+ print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'}
47
+
48
+ # Inverted
49
+ print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4}
50
+
51
+ # Prepare caption style
52
+ style_prefix = f"{model.named_task_mapping['tags']}: "
53
+ style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels
54
+
55
+ # Generate caption
56
+ model.eval()
57
+ outputs = model.generate(
58
+ inputs=features.to(model.device),
59
+ forced_ac_decoder_ids=style_prefix_tokens,
60
+ max_length=100,
61
+ )
62
+
63
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
64
+ ```
65
+
66
+ Example output:
67
+ *0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental*
68
+
69
+ WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`.
70
+
71
+ The model class `WhisperForAudioCaptioning` can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix.
72
+
73
+
74
+ ## Licence
75
+
 
 
 
76
  The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.