Audio Classification
Transformers
Safetensors
ced
File size: 2,930 Bytes
f6226b7
 
6559474
 
 
 
 
f6226b7
6559474
 
a135669
 
 
 
 
 
 
 
 
 
 
 
6559474
 
 
 
 
 
eb67f15
ce36054
6559474
 
 
374180c
a59e061
af3649a
a59e061
6559474
374180c
6559474
 
 
 
374180c
 
 
6559474
 
3f3cfe7
374180c
6559474
374180c
 
6559474
 
 
374180c
 
6559474
 
374180c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: apache-2.0
datasets:
- AudioSet
metrics:
- mAP
pipeline_tag: audio-classification
---

# CED-Mini Model
CED are simple ViT-Transformer-based models for audio tagging, achieving **sota** performance on Audioset.


| Model | Parameters (M) | AS-20K (mAP) | AS-2M (mAP) |
|------|-------|-------|-------|
| CED-Tiny | 5.5   | 36.5  | 48.1  |
| CED-Mini | 9.6    | 38.5  | 49.0  |
| CED-Small| 22    | 41.6  | 49.6  |
| CED-Base | 86    | 44.0  | 50.0  |


Notable differences from other available models include:
1. Simplification for finetuning: Batchnormalization of Mel-Spectrograms. During finetuning one does not need to first compute mean/variance over the dataset, which is common for AST.
1. Support for variable length inputs. Most other models use a static time-frequency position embedding, which hinders the model's generalization to segments shorter than 10s. Many previous transformers simply pad their input to 10s in order to avoid the performance impact, which in turn slows down training/inference drastically.
1. Training/Inference speedup: 64-dimensional mel-filterbanks and 16x16 patches without overlap, leading to 248 patches from a 10s spectrogram. In comparison, AST uses 128 mel-filterbanks with 16x16 (10x10 overlap) convolution, leading to 1212 patches during training/inference. CED-Tiny runs on a common CPU as fast as a comparable MobileNetV3.
1. Performance: CED with 10M parameters outperforms the majority of previous approaches (~80M).

### Model Sources
- **Original Repository:** https://github.com/RicherMans/CED
- **Repository:** https://github.com/jimbozhang/hf_transformers_custom_model_ced
- **Paper:** [CED: Consistent ensemble distillation for audio tagging](https://arxiv.org/abs/2308.11957)
- **Demo:** https://huggingface.co/spaces/mispeech/ced-base

## Install
```bash
pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_ced.git
```

## Inference
```python
>>> from ced_model.feature_extraction_ced import CedFeatureExtractor
>>> from ced_model.modeling_ced import CedForAudioClassification

>>> model_name = "mispeech/ced-mini"
>>> feature_extractor = CedFeatureExtractor.from_pretrained(model_name)
>>> model = CedForAudioClassification.from_pretrained(model_name)

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")

>>> import torch
>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> predicted_class_id = torch.argmax(logits, dim=-1).item()
>>> model.config.id2label[predicted_class_id]
'Finger snapping'
```

## Fine-tuning
[`example_finetune_esc50.ipynb`](https://github.com/jimbozhang/hf_transformers_custom_model_ced/blob/main/example_finetune_esc50.ipynb) demonstrates how to train a linear head on the ESC-50 dataset with the CED encoder frozen.