topel
/

ConvNeXt-Tiny-AT

Safetensors

audio tagging

audio events

audio embeddings

convnext-audio

audioset

Model card Files Files and versions Community

topel commited on Sep 28, 2023

Commit

93056c3

1 Parent(s): e3808a4

Update README.md

Browse files

Files changed (1) hide show

README.md +37 -17

README.md CHANGED Viewed

@@ -11,8 +11,10 @@ inference: false
 **ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
-The model expects as input audio files of duration 10 seconds, and sample rate 32kHz.
-It provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
 Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
 The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
@@ -29,13 +31,15 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
 # Usage
-Below is an example of how to instantiate our model convnext_tiny_471mAP.pth
 ```python
 import os
 import numpy as np
 import torch
 import torchaudio
 from audioset_convnext_inf.pytorch.convnext import ConvNeXt
 from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
@@ -66,13 +70,28 @@ Output:
 sample_rate = 32000
 audio_target_length = 10 * sample_rate  # 10 s
-AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
 AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
 waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
 if sample_rate_ != sample_rate:
-    print("ERROR: sampling rate not 32k Hz", sample_rate_)
 waveform = waveform.to(device)
 print("\nInference on " + AUDIO_FNAME + "\n")
@@ -101,19 +120,24 @@ for l in sample_labels:
 Output:
 ```
 logits size: torch.Size([1, 527])
 probs size: torch.Size([1, 527])
 Predicted labels using activity threshold 0.25:
-Speech: 0.626
-Music: 0.842
-Musical instrument: 0.362
-Plucked string instrument: 0.307
-Ukulele: 0.703
-Inside, small room: 0.305
 ```
 ## Get audio scene embeddings
@@ -148,10 +172,6 @@ Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
 The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
-Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth
-The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set.
 # Citation

 **ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
+The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.
+The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
 Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
 The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
 # Usage
+Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).
 ```python
 import os
 import numpy as np
 import torch
+from torch.nn import functional as TF
 import torchaudio
+import torchaudio.functional as TAF
 from audioset_convnext_inf.pytorch.convnext import ConvNeXt
 from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
 sample_rate = 32000
 audio_target_length = 10 * sample_rate  # 10 s
+# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
+AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
 AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
 waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
 if sample_rate_ != sample_rate:
+    print("Resampling from %d to 32000 Hz"%sample_rate_)
+    waveform = TAF.resample(
+        waveform,
+        sample_rate_,
+        sample_rate,
+        )
+if waveform.shape[-1] < audio_target_length:
+    print("Padding waveform")
+    missing = max(audio_target_length - waveform.shape[-1], 0)
+    waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
+elif waveform.shape[-1] > audio_target_length:
+    print("Cropping waveform")
+    waveform = waveform[:, :audio_target_length]
+waveform = waveform.contiguous()
 waveform = waveform.to(device)
 print("\nInference on " + AUDIO_FNAME + "\n")
 Output:
 ```
+Inference on 254906__tpellegrini__cavaco1.wav
+Resampling rate from 44100 to 32000 Hz
+Padding waveform
 logits size: torch.Size([1, 527])
 probs size: torch.Size([1, 527])
 Predicted labels using activity threshold 0.25:
+[137 138 139 140 149 151]
+Music: 0.896
+Musical instrument: 0.686
+Plucked string instrument: 0.608
+Guitar: 0.369
+Mandolin: 0.710
+Ukulele: 0.268
 ```
+Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho!
 ## Get audio scene embeddings
 The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
 # Citation