Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,10 @@ inference: false
|
|
11 |
|
12 |
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
|
13 |
|
14 |
-
The model
|
15 |
-
|
|
|
|
|
16 |
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
|
17 |
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
|
18 |
|
@@ -29,13 +31,15 @@ pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
|
|
29 |
|
30 |
# Usage
|
31 |
|
32 |
-
Below is an example of how to instantiate
|
33 |
|
34 |
```python
|
35 |
import os
|
36 |
import numpy as np
|
37 |
import torch
|
|
|
38 |
import torchaudio
|
|
|
39 |
|
40 |
from audioset_convnext_inf.pytorch.convnext import ConvNeXt
|
41 |
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
|
@@ -66,13 +70,28 @@ Output:
|
|
66 |
sample_rate = 32000
|
67 |
audio_target_length = 10 * sample_rate # 10 s
|
68 |
|
69 |
-
AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
|
|
|
70 |
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
|
71 |
|
72 |
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
|
73 |
if sample_rate_ != sample_rate:
|
74 |
-
print("
|
75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
waveform = waveform.to(device)
|
77 |
|
78 |
print("\nInference on " + AUDIO_FNAME + "\n")
|
@@ -101,19 +120,24 @@ for l in sample_labels:
|
|
101 |
|
102 |
Output:
|
103 |
```
|
|
|
|
|
|
|
|
|
104 |
logits size: torch.Size([1, 527])
|
105 |
probs size: torch.Size([1, 527])
|
106 |
-
|
107 |
Predicted labels using activity threshold 0.25:
|
108 |
|
109 |
-
|
110 |
-
Music: 0.
|
111 |
-
Musical instrument: 0.
|
112 |
-
Plucked string instrument: 0.
|
113 |
-
|
114 |
-
|
|
|
115 |
```
|
116 |
|
|
|
117 |
|
118 |
|
119 |
## Get audio scene embeddings
|
@@ -148,10 +172,6 @@ Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
|
|
148 |
|
149 |
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
|
150 |
|
151 |
-
Together with a second checkpoint: convnext_tiny_465mAP_BL_AC_70kit.pth
|
152 |
-
|
153 |
-
The second model is useful to perform audio captioning on the AudioCaps dataset without training data biases. It was trained the same way as the current model, for audio tagging on AudioSet, but the files from AudioCaps were removed from the AudioSet development set.
|
154 |
-
|
155 |
|
156 |
# Citation
|
157 |
|
|
|
11 |
|
12 |
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
|
13 |
|
14 |
+
The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.
|
15 |
+
|
16 |
+
The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
|
17 |
+
|
18 |
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
|
19 |
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
|
20 |
|
|
|
31 |
|
32 |
# Usage
|
33 |
|
34 |
+
Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).
|
35 |
|
36 |
```python
|
37 |
import os
|
38 |
import numpy as np
|
39 |
import torch
|
40 |
+
from torch.nn import functional as TF
|
41 |
import torchaudio
|
42 |
+
import torchaudio.functional as TAF
|
43 |
|
44 |
from audioset_convnext_inf.pytorch.convnext import ConvNeXt
|
45 |
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
|
|
|
70 |
sample_rate = 32000
|
71 |
audio_target_length = 10 * sample_rate # 10 s
|
72 |
|
73 |
+
# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
|
74 |
+
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
|
75 |
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
|
76 |
|
77 |
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
|
78 |
if sample_rate_ != sample_rate:
|
79 |
+
print("Resampling from %d to 32000 Hz"%sample_rate_)
|
80 |
+
waveform = TAF.resample(
|
81 |
+
waveform,
|
82 |
+
sample_rate_,
|
83 |
+
sample_rate,
|
84 |
+
)
|
85 |
+
|
86 |
+
if waveform.shape[-1] < audio_target_length:
|
87 |
+
print("Padding waveform")
|
88 |
+
missing = max(audio_target_length - waveform.shape[-1], 0)
|
89 |
+
waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
|
90 |
+
elif waveform.shape[-1] > audio_target_length:
|
91 |
+
print("Cropping waveform")
|
92 |
+
waveform = waveform[:, :audio_target_length]
|
93 |
+
|
94 |
+
waveform = waveform.contiguous()
|
95 |
waveform = waveform.to(device)
|
96 |
|
97 |
print("\nInference on " + AUDIO_FNAME + "\n")
|
|
|
120 |
|
121 |
Output:
|
122 |
```
|
123 |
+
Inference on 254906__tpellegrini__cavaco1.wav
|
124 |
+
|
125 |
+
Resampling rate from 44100 to 32000 Hz
|
126 |
+
Padding waveform
|
127 |
logits size: torch.Size([1, 527])
|
128 |
probs size: torch.Size([1, 527])
|
|
|
129 |
Predicted labels using activity threshold 0.25:
|
130 |
|
131 |
+
[137 138 139 140 149 151]
|
132 |
+
Music: 0.896
|
133 |
+
Musical instrument: 0.686
|
134 |
+
Plucked string instrument: 0.608
|
135 |
+
Guitar: 0.369
|
136 |
+
Mandolin: 0.710
|
137 |
+
Ukulele: 0.268
|
138 |
```
|
139 |
|
140 |
+
Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho!
|
141 |
|
142 |
|
143 |
## Get audio scene embeddings
|
|
|
172 |
|
173 |
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
|
174 |
|
|
|
|
|
|
|
|
|
175 |
|
176 |
# Citation
|
177 |
|