|
--- |
|
license: apache-2.0 |
|
tags: |
|
- contrastive learning |
|
- CLAP |
|
- audio classification |
|
- zero-shot classification |
|
--- |
|
|
|
[![arXiv](https://img.shields.io/badge/10.21437%2FInterspeech.2024-red?label=paper-pdf)](https://www.isca-archive.org/interspeech_2024/paissan24_interspeech.pdf) |
|
|
|
# tinyCLAP: Distilling Contrastive Language-Audio Pretrained models |
|
|
|
This repository contains the official implementation of [tinyCLAP](https://www.isca-archive.org/interspeech_2024/paissan24_interspeech.html). |
|
To access the project website, use [this link](https://francescopaissan.it/tinyclapweb/). |
|
|
|
![tinyCLAP overview](https://francescopaissan.it/tinyclapweb/assets/overview.png) |
|
|
|
## Requirements |
|
|
|
To clone the repo and install requirements: |
|
|
|
```setup |
|
git clone https://github.com/fpaissan/tinyCLAP & cd tinyCLAP |
|
pip install -r extra_requirements.txt |
|
``` |
|
|
|
## Training |
|
|
|
To train the model(s) in the paper, run this command: |
|
|
|
```bash |
|
MODEL_NAME=phinet_alpha_1.50_beta_0.75_t0_6_N_7 |
|
|
|
./run_tinyCLAP.sh $MODEL_NAME |
|
``` |
|
|
|
Note that `MODEL_NAME` is formatted such that the script will automatically parse the configuration for the student model. |
|
You can change parameters by changing the model name. |
|
|
|
Please note: |
|
- To use the original CLAP encoder in the distillation setting, replace the model name with `Cnn14`; |
|
- To reproduce the variants of PhiNet from the manuscript, refer to the hyperparameters listed in Table 1. |
|
|
|
## Evaluation |
|
|
|
The command to evaluate the model on each dataset varies slightly among datasets. |
|
Below are listed all the necessary commands. |
|
|
|
### ESC50 |
|
|
|
```bash |
|
python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --esc_folder $PATH_TO_ESC |
|
``` |
|
|
|
### UrbanSound8K |
|
|
|
```bash |
|
python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --us8k_folder $PATH_TO_US8K |
|
``` |
|
|
|
### TUT17 |
|
|
|
```bash |
|
python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --tut17_folder $PATH_TO_TUT17 |
|
``` |
|
|
|
## Pre-trained Models |
|
|
|
You can download pretrained models from the [tinyCLAP HF](https://huggingface.co/fpaissan/tinyCLAP). |
|
|
|
_Note_: The checkpoints on HF contain the entire CLAP module (complete of text encoder and teacher encoder). |
|
|
|
To run inference using the pretrained models, please use: |
|
|
|
```bash |
|
python train_clap.py hparams/distill_clap.yaml --pretrained_clap fpaissan/tinyCLAP/$MODEL_NAME.ckpt --zs_eval True --tut17_folder $PATH_TO_TUT17 |
|
``` |
|
|
|
This command will automatically download the checkpoint, if present in the zoo of pretrained models. Make sure to change the dataset configuration file based on the evaluation. |
|
|
|
A list of available models with their computational cost is described in the follwing table: |
|
|
|
| audioenc_name_student | Params [M] | ESC-50 | UrbanSound8K | TUT17 | |
|
|:-----:|:----------:|:------:|:------------:|:-----:| |
|
| MSFT CLAP | 82.8 | 80.7% | 72.1% | 25.2% | |
|
| Cnn14 | 82.8 | 81.3% | 72.3% | 23.7% | |
|
| phinet_alpha_1.50_beta_0.75_t0_6_N_7 | 4.4 | 77.3% | 69.7% | 21.9% | |
|
|
|
The original paper's checkpoints are available through [this link](https://www.dropbox.com/scl/fi/e3aj76vxwlb4w6hs3mclv/tinyCLAP_results.zip?rlkey=7fl426tz1vf686oyosvja8i9s&dl=0). |
|
|
|
## Citing tinyCLAP |
|
|
|
``` |
|
@inproceedings{paissan24_interspeech, |
|
title = {tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models}, |
|
author = {Francesco Paissan and Elisabetta Farella}, |
|
year = {2024}, |
|
booktitle = {Interspeech 2024}, |
|
pages = {1685--1689}, |
|
doi = {10.21437/Interspeech.2024-193}, |
|
issn = {2958-1796}, |
|
} |
|
``` |