File size: 3,608 Bytes
f07425a d498b46 f07425a d498b46 f07425a d498b46 f07425a 844ccbe f07425a 844ccbe f07425a 6534ede f07425a 6534ede f07425a 6534ede f07425a 844ccbe f07425a 844ccbe 6534ede 844ccbe c85e695 12b17c8 c85e695 b525c98 f07425a d498b46 f07425a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
license: apache-2.0
tags:
- contrastive learning
- CLAP
- audio classification
- zero-shot classification
---
[![arXiv](https://img.shields.io/badge/10.21437%2FInterspeech.2024-red?label=paper-pdf)](https://www.isca-archive.org/interspeech_2024/paissan24_interspeech.pdf)
# tinyCLAP: Distilling Contrastive Language-Audio Pretrained models
This repository contains the official implementation of [tinyCLAP](https://www.isca-archive.org/interspeech_2024/paissan24_interspeech.html).
To access the project website, use [this link](https://francescopaissan.it/tinyclapweb/).
![tinyCLAP overview](https://francescopaissan.it/tinyclapweb/assets/overview.png)
## Requirements
To clone the repo and install requirements:
```setup
git clone https://github.com/fpaissan/tinyCLAP & cd tinyCLAP
pip install -r extra_requirements.txt
```
## Training
To train the model(s) in the paper, run this command:
```bash
MODEL_NAME=phinet_alpha_1.50_beta_0.75_t0_6_N_7
./run_tinyCLAP.sh $MODEL_NAME
```
Note that `MODEL_NAME` is formatted such that the script will automatically parse the configuration for the student model.
You can change parameters by changing the model name.
Please note:
- To use the original CLAP encoder in the distillation setting, replace the model name with `Cnn14`;
- To reproduce the variants of PhiNet from the manuscript, refer to the hyperparameters listed in Table 1.
## Evaluation
The command to evaluate the model on each dataset varies slightly among datasets.
Below are listed all the necessary commands.
### ESC50
```bash
python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --esc_folder $PATH_TO_ESC
```
### UrbanSound8K
```bash
python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --us8k_folder $PATH_TO_US8K
```
### TUT17
```bash
python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --tut17_folder $PATH_TO_TUT17
```
## Pre-trained Models
You can download pretrained models from the [tinyCLAP HF](https://huggingface.co/fpaissan/tinyCLAP).
_Note_: The checkpoints on HF contain the entire CLAP module (complete of text encoder and teacher encoder).
To run inference using the pretrained models, please use:
```bash
python train_clap.py hparams/distill_clap.yaml --pretrained_clap fpaissan/tinyCLAP/$MODEL_NAME.ckpt --zs_eval True --tut17_folder $PATH_TO_TUT17
```
This command will automatically download the checkpoint, if present in the zoo of pretrained models. Make sure to change the dataset configuration file based on the evaluation.
A list of available models with their computational cost is described in the follwing table:
| audioenc_name_student | Params [M] | ESC-50 | UrbanSound8K | TUT17 |
|:-----:|:----------:|:------:|:------------:|:-----:|
| MSFT CLAP | 82.8 | 80.7% | 72.1% | 25.2% |
| Cnn14 | 82.8 | 81.3% | 72.3% | 23.7% |
| phinet_alpha_1.50_beta_0.75_t0_6_N_7 | 4.4 | 77.3% | 69.7% | 21.9% |
The original paper's checkpoints are available through [this link](https://www.dropbox.com/scl/fi/e3aj76vxwlb4w6hs3mclv/tinyCLAP_results.zip?rlkey=7fl426tz1vf686oyosvja8i9s&dl=0).
## Citing tinyCLAP
```
@inproceedings{paissan24_interspeech,
title = {tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models},
author = {Francesco Paissan and Elisabetta Farella},
year = {2024},
booktitle = {Interspeech 2024},
pages = {1685--1689},
doi = {10.21437/Interspeech.2024-193},
issn = {2958-1796},
}
``` |