File size: 5,474 Bytes
8006e9b
 
e6a5d8c
 
 
 
 
 
e3d0429
8006e9b
a0b8f57
1e97b39
d16a293
93056c3
 
 
 
d16a293
 
 
1e97b39
d16a293
 
 
 
e3808a4
d16a293
 
 
 
 
 
 
93056c3
e6a5d8c
 
d16a293
 
 
93056c3
d16a293
93056c3
d16a293
e6a5d8c
1b66305
e6a5d8c
ffcf431
d16a293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93056c3
 
d16a293
 
 
 
93056c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d16a293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b66305
 
 
d16a293
 
1b66305
 
 
 
d16a293
 
 
 
93056c3
 
 
 
d16a293
 
 
 
93056c3
 
 
 
 
 
 
d16a293
 
93056c3
d16a293
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0b8f57
 
 
 
d16a293
a0b8f57
d16a293
1e97b39
 
9040c27
 
d16a293
9040c27
 
 
 
 
 
 
 
e6a5d8c
 
9040c27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
license: mit
tags:
- audio tagging
- audio events
- audio embeddings
- convnext-audio
- audioset
inference: false
---

**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).

The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.

The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).

Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.


# Install

This code is based on our repo: https://github.com/topel/audioset-convnext-inf

You can pip install it:

```bash
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
```

# Usage

Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels). 

```python
import os
import numpy as np
import torch
from torch.nn import functional as TF
import torchaudio
import torchaudio.functional as TAF

from audioset_convnext_inf.pytorch.convnext import ConvNeXt
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags

model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')

print(
    "# params:",
    sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

if "cuda" in str(device):
    model = model.to(device)
```

Output:
```
# params: 28222767
```

## Inference: get logits and probabilities

```python
sample_rate = 32000
audio_target_length = 10 * sample_rate  # 10 s

# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)

waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
    print("Resampling from %d to 32000 Hz"%sample_rate_)
    waveform = TAF.resample(
        waveform,
        sample_rate_,
        sample_rate,
        )

if waveform.shape[-1] < audio_target_length:
    print("Padding waveform")
    missing = max(audio_target_length - waveform.shape[-1], 0)
    waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
elif waveform.shape[-1] > audio_target_length: 
    print("Cropping waveform")
    waveform = waveform[:, :audio_target_length]

waveform = waveform.contiguous()
waveform = waveform.to(device)

print("\nInference on " + AUDIO_FNAME + "\n")

with torch.no_grad():
    model.eval()
    output = model(waveform)

logits = output["clipwise_logits"]
print("logits size:", logits.size())

probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())

current_dir=os.getcwd()
lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))

threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("\nPredicted labels using activity threshold 0.25:\n")
# print(sample_labels)
for l in sample_labels:
    print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))
```

Output:
```
Inference on 254906__tpellegrini__cavaco1.wav

Resampling rate from 44100 to 32000 Hz
Padding waveform
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:

[137 138 139 140 149 151]
Music: 0.896
Musical instrument: 0.686
Plucked string instrument: 0.608
Guitar: 0.369
Mandolin: 0.710
Ukulele: 0.268
```

Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho!


## Get audio scene embeddings
```python
with torch.no_grad():
    model.eval()
    output = model.forward_scene_embeddings(waveform)

print("\nScene embedding, shape:", output.size())
```

Output:
```
Scene embedding, shape: torch.Size([1, 768])
```

## Get frame-level embeddings
```python
with torch.no_grad():
    model.eval()
    output = model.forward_frame_embeddings(waveform)

print("\nFrame-level embeddings, shape:", output.size())
```

Output:
```
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
```

# Zenodo

The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1


# Citation

[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)

Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564

```bibtex
@inproceedings{pellegrini23_interspeech,
  author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
  title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4169--4173},
  doi={10.21437/Interspeech.2023-1564}
}
```