fpaissan
/

tinyCLAP

contrastive learning

audio classification

zero-shot classification

Model card Files Files and versions Community

tinyCLAP / README.md

fpaissan's picture

Update README.md

d498b46 verified 3 months ago

|

history blame contribute delete

3.61 kB

	---
	license: apache-2.0
	tags:
	- contrastive learning
	- CLAP
	- audio classification
	- zero-shot classification
	---

	[![arXiv](https://img.shields.io/badge/10.21437%2FInterspeech.2024-red?label=paper-pdf)](https://www.isca-archive.org/interspeech_2024/paissan24_interspeech.pdf)

	# tinyCLAP: Distilling Contrastive Language-Audio Pretrained models

	This repository contains the official implementation of [tinyCLAP](https://www.isca-archive.org/interspeech_2024/paissan24_interspeech.html).
	To access the project website, use [this link](https://francescopaissan.it/tinyclapweb/).

	![tinyCLAP overview](https://francescopaissan.it/tinyclapweb/assets/overview.png)

	## Requirements

	To clone the repo and install requirements:

	```setup
	git clone https://github.com/fpaissan/tinyCLAP & cd tinyCLAP
	pip install -r extra_requirements.txt
	```

	## Training

	To train the model(s) in the paper, run this command:

	```bash
	MODEL_NAME=phinet_alpha_1.50_beta_0.75_t0_6_N_7

	./run_tinyCLAP.sh $MODEL_NAME
	```

	Note that `MODEL_NAME` is formatted such that the script will automatically parse the configuration for the student model.
	You can change parameters by changing the model name.

	Please note:
	- To use the original CLAP encoder in the distillation setting, replace the model name with `Cnn14`;
	- To reproduce the variants of PhiNet from the manuscript, refer to the hyperparameters listed in Table 1.

	## Evaluation

	The command to evaluate the model on each dataset varies slightly among datasets.
	Below are listed all the necessary commands.

	### ESC50

	```bash
	python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --esc_folder $PATH_TO_ESC
	```

	### UrbanSound8K

	```bash
	python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --us8k_folder $PATH_TO_US8K
	```

	### TUT17

	```bash
	python train_clap.py hparams/distill_clap.yaml --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --tut17_folder $PATH_TO_TUT17
	```

	## Pre-trained Models

	You can download pretrained models from the [tinyCLAP HF](https://huggingface.co/fpaissan/tinyCLAP).

	_Note_: The checkpoints on HF contain the entire CLAP module (complete of text encoder and teacher encoder).

	To run inference using the pretrained models, please use:

	```bash
	python train_clap.py hparams/distill_clap.yaml --pretrained_clap fpaissan/tinyCLAP/$MODEL_NAME.ckpt --zs_eval True --tut17_folder $PATH_TO_TUT17
	```

	This command will automatically download the checkpoint, if present in the zoo of pretrained models. Make sure to change the dataset configuration file based on the evaluation.

	A list of available models with their computational cost is described in the follwing table:

	\| audioenc_name_student \| Params [M] \| ESC-50 \| UrbanSound8K \| TUT17 \|
	\|:-----:\|:----------:\|:------:\|:------------:\|:-----:\|
	\| MSFT CLAP \| 82.8 \| 80.7% \| 72.1% \| 25.2% \|
	\| Cnn14 \| 82.8 \| 81.3% \| 72.3% \| 23.7% \|
	\| phinet_alpha_1.50_beta_0.75_t0_6_N_7 \| 4.4 \| 77.3% \| 69.7% \| 21.9% \|

	The original paper's checkpoints are available through [this link](https://www.dropbox.com/scl/fi/e3aj76vxwlb4w6hs3mclv/tinyCLAP_results.zip?rlkey=7fl426tz1vf686oyosvja8i9s&dl=0).

	## Citing tinyCLAP

	```
	@inproceedings{paissan24_interspeech,
	title = {tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models},
	author = {Francesco Paissan and Elisabetta Farella},
	year = {2024},
	booktitle = {Interspeech 2024},
	pages = {1685--1689},
	doi = {10.21437/Interspeech.2024-193},
	issn = {2958-1796},
	}
	```