--- license: cc-by-nc-4.0 language: - en library_name: transformers --- # Whisper Multitask Analyzer A transformer encoder-decoder model for automatic audio captioning. As opposed to speech-to-text, captioning describes the content and features of audio clips. - **Model, codebase & card adapted from:** MU-NLPC/whisper-small-audio-captioning - **Model type:** Whisper encoder-decoder transformer - **Language(s) (NLP):** en - **License:** cc-by-4.0 - **Parent Model:** openai/whisper-small ## Usage The model expects an audio clip (up to 30s) to the encoder as an input and information about caption style as forced prefix to the decoder. The forced prefix is an integer which is mapped to various tasks. This mapping is defined in the model config and can be retrieved with a function. The tag mapping of the current model is: | Task | ID | Description | | -------- | -- | ------------------------------------------------------ | | tags | 0 | General descriptions, can include genres and features. | | genre | 1 | Estimated musical genres. | | mood | 2 | Estimated emotional feeling. | | movement | 3 | Estimated audio pace and expression. | | theme | 4 | Estimated audio usage (not very accurate) | ``` Minimal example: ```python # Load model checkpoint = "DionTimmer/whisper-small-multitask-analyzer" model = WhisperForAudioCaptioning.from_pretrained(checkpoint) tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe") feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(checkpoint) # Load and preprocess audio input_file = "..." audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate) features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features # Mappings by ID print(model.task_mapping) # {0: 'tags', 1: 'genre', 2: 'mood', 3: 'movement', 4: 'theme'} # Inverted print(model.named_task_mapping) # {'tags': 0, 'genre': 1, 'mood': 2, 'movement': 3, 'theme': 4} # Prepare caption style style_prefix = f"{model.named_task_mapping['tags']}: " style_prefix_tokens = tokenizer("", text_target=style_prefix, return_tensors="pt", add_special_tokens=False).labels # Generate caption model.eval() outputs = model.generate( inputs=features.to(model.device), forced_ac_decoder_ids=style_prefix_tokens, max_length=100, ) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]) ``` Example output: *0: advertising, beautiful, beauty, bright, cinematic, commercial, corporate, emotional, epic, film, heroic, hopeful, inspiration, inspirational, inspiring, love, love story, movie, orchestra, orchestral, piano, positive, presentation, romantic, sentimental* WhisperTokenizer must be initialized with `language="en"` and `task="transcribe"`. The model class `WhisperForAudioCaptioning` can be found in the git repository or here on the HuggingFace Hub in the model repository. The class overrides default Whisper `generate` method to support forcing decoder prefix. ## Licence The model weights are published under non-commercial license CC BY-NC 4.0 as the model was finetuned on a dataset for non-commercial use.