|
--- |
|
license: apache-2.0 |
|
tags: |
|
- contrastive learning |
|
- CLAP |
|
- audio classification |
|
- zero-shot classification |
|
--- |
|
|
|
# tinyCLAP: Distilling Contrastive Language-Audio Pretrained models |
|
|
|
[![arXiv](https://img.shields.io/badge/arXiv-1234.56789-b31b1b.svg)](https://arxiv.org/abs/2311.14517) |
|
|
|
This repository contains the official implementation of [tinyCLAP](https://arxiv.org/abs/2311.14517). |
|
To access the project website, using [this link](https://francescopaissan.it/tinyclapweb/). |
|
|
|
![tinyCLAP overview](https://francescopaissan.it/tinyclapweb/assets/overview.png) |
|
|
|
## Requirements |
|
|
|
To clone the repo and install requirements: |
|
|
|
```setup |
|
git clone https://github.com/fpaissan/tinyCLAP & cd tinyCLAP |
|
pip install -r extra_requirements.txt |
|
``` |
|
|
|
## Training |
|
|
|
To train the model(s) in the paper, run this command: |
|
|
|
```bash |
|
MODEL_NAME=phinet_alpha_1.50_beta_0.75_t0_6_N_7 |
|
|
|
./run_tinyCLAP.sh $MODEL_NAME |
|
``` |
|
|
|
Note that `MODEL_NAME` is formatted such that the script will automatically parse the configuration for the student model. |
|
You can change parameters by changing the model name. |
|
|
|
Please note: |
|
- To use the original CLAP encoder in the distillation setting, replace the model name with `Cnn14`; |
|
- To reproduce the variants of PhiNet from the manuscript, refer to the hyperparameters listed in Table 1. |
|
|
|
## Evaluation |
|
|
|
The command to evaluate the model on each dataset varies slightly among datasets. |
|
Below are listed all the necessary commands. |
|
|
|
### ESC50 |
|
|
|
```bash |
|
python train_clap.py --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --esc_folder $PATH_TO_ESC |
|
``` |
|
|
|
### UrbanSound8K |
|
|
|
```bash |
|
python train_clap.py --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --us8k_folder $PATH_TO_US8K |
|
``` |
|
|
|
### TUT17 |
|
|
|
```bash |
|
python train_clap.py --experiment_name tinyCLAP_$MODEL_NAME --zs_eval True --tut17_folder $PATH_TO_TUT17 |
|
``` |
|
|
|
## Pre-trained Models |
|
|
|
You can download pretrained models from the [tinyCLAP HF](https://huggingface.co/fpaissan/tinyCLAP). |
|
|
|
_Note_: The checkpoints on HF contain the entire CLAP module (complete of text encoder and teacher encoder). |
|
|
|
To run inference using the pretrained models, please use: |
|
|
|
```bash |
|
python train_clap.py --pretrained_clap fpaissan/tinyCLAP/$MODEL_NAME --zs_eval True --tut17_folder $PATH_TO_TUT17 |
|
``` |
|
|
|
This command will automatically download the checkpoint, if present in the zoo of pretrained models. Make sure to change the dataset configuration file based on the evaluation. |
|
|
|
A list of available models with their computational cost is described in the follwing table: |
|
|
|
| alpha | beta | t0 | N | Params [M] | ESC-50 | UrbanSound8K | TUT17 | |
|
|:-----:|:----:|:--:|:-:|:----------:|:------:|:------------:|:-----:| |
|
| 1.5 | 0.75 | 6 | 7 | 4.4 | | | | |
|
|
|
## Citing tinyCLAP |
|
|
|
``` |
|
@inproceedings{paissan2024tinyclap, |
|
title={tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models}, |
|
author={Paissan, Francesco and Farella, Elisabetta}, |
|
journal={Interspeech 2024}, |
|
year={2024} |
|
} |
|
``` |