File size: 5,474 Bytes
8006e9b e6a5d8c e3d0429 8006e9b a0b8f57 1e97b39 d16a293 93056c3 d16a293 1e97b39 d16a293 e3808a4 d16a293 93056c3 e6a5d8c d16a293 93056c3 d16a293 93056c3 d16a293 e6a5d8c 1b66305 e6a5d8c ffcf431 d16a293 93056c3 d16a293 93056c3 d16a293 1b66305 d16a293 1b66305 d16a293 93056c3 d16a293 93056c3 d16a293 93056c3 d16a293 a0b8f57 d16a293 a0b8f57 d16a293 1e97b39 9040c27 d16a293 9040c27 e6a5d8c 9040c27 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
license: mit
tags:
- audio tagging
- audio events
- audio embeddings
- convnext-audio
- audioset
inference: false
---
**ConvNeXt-Tiny-AT** is an audio tagging CNN model, trained on **AudioSet** (balanced+unbalanced subsets). It reached 0.471 mAP on the test set [(Paper)](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html).
The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.
The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).
Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below.
The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.
# Install
This code is based on our repo: https://github.com/topel/audioset-convnext-inf
You can pip install it:
```bash
pip install git+https://github.com/topel/audioset-convnext-inf@pip-install
```
# Usage
Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).
```python
import os
import numpy as np
import torch
from torch.nn import functional as TF
import torchaudio
import torchaudio.functional as TAF
from audioset_convnext_inf.pytorch.convnext import ConvNeXt
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags
model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')
print(
"# params:",
sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
if "cuda" in str(device):
model = model.to(device)
```
Output:
```
# params: 28222767
```
## Inference: get logits and probabilities
```python
sample_rate = 32000
audio_target_length = 10 * sample_rate # 10 s
# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"
AUDIO_FPATH = os.path.join("/path/to/audio", AUDIO_FNAME)
waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
print("Resampling from %d to 32000 Hz"%sample_rate_)
waveform = TAF.resample(
waveform,
sample_rate_,
sample_rate,
)
if waveform.shape[-1] < audio_target_length:
print("Padding waveform")
missing = max(audio_target_length - waveform.shape[-1], 0)
waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
elif waveform.shape[-1] > audio_target_length:
print("Cropping waveform")
waveform = waveform[:, :audio_target_length]
waveform = waveform.contiguous()
waveform = waveform.to(device)
print("\nInference on " + AUDIO_FNAME + "\n")
with torch.no_grad():
model.eval()
output = model(waveform)
logits = output["clipwise_logits"]
print("logits size:", logits.size())
probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())
current_dir=os.getcwd()
lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))
threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("\nPredicted labels using activity threshold 0.25:\n")
# print(sample_labels)
for l in sample_labels:
print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))
```
Output:
```
Inference on 254906__tpellegrini__cavaco1.wav
Resampling rate from 44100 to 32000 Hz
Padding waveform
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:
[137 138 139 140 149 151]
Music: 0.896
Musical instrument: 0.686
Plucked string instrument: 0.608
Guitar: 0.369
Mandolin: 0.710
Ukulele: 0.268
```
Technically, it's not a Mandolin nor a Ukulele, but the Ukulele Brazilian cousin, the cavaquinho!
## Get audio scene embeddings
```python
with torch.no_grad():
model.eval()
output = model.forward_scene_embeddings(waveform)
print("\nScene embedding, shape:", output.size())
```
Output:
```
Scene embedding, shape: torch.Size([1, 768])
```
## Get frame-level embeddings
```python
with torch.no_grad():
model.eval()
output = model.forward_frame_embeddings(waveform)
print("\nFrame-level embeddings, shape:", output.size())
```
Output:
```
Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])
```
# Zenodo
The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1
# Citation
[Paper available](https://www.isca-speech.org/archive/interspeech_2023/pellegrini23_interspeech.html)
Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564
```bibtex
@inproceedings{pellegrini23_interspeech,
author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={4169--4173},
doi={10.21437/Interspeech.2023-1564}
}
```
|