conette / README.md

Upload README.md with huggingface_hub

86a524c about 1 year ago

5.23 kB

	---
	language: en
	license: mit
	tags:
	- audio
	- captioning
	- text
	- audio-captioning
	- automated-audio-captioning
	model_name: CoNeTTE
	task_categories:
	- audio-captioning
	---

	<div align="center">

	# CoNeTTE model source

	<a href="https://www.python.org/"><img alt="Python" src="https://img.shields.io/badge/-Python 3.10+-blue?style=for-the-badge&logo=python&logoColor=white"></a><a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white"></a><a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray"></a>
	<a href="https://github.com/Labbeti/conette-audio-captioning/actions">
	<img alt="Build" src="https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/python-package-pip.yaml?branch=main&style=for-the-badge&logo=github">
	</a>
	<!-- <a href='https://aac-metrics.readthedocs.io/en/stable/?badge=stable'>
	<img src='https://readthedocs.org/projects/aac-metrics/badge/?version=stable&style=for-the-badge' alt='Documentation Status' />
	</a> -->

	CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file.

	</div>

	CoNeTTE has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. CoNeTTE stands for ConvNeXt-Transformer model with Task Embedding, and the architecture and training is explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf).

	## Installation
	```bash
	python -m pip install conette
	python -m spacy download en_core_web_sm
	```

	## Usage with python
	```py
	from conette import CoNeTTEConfig, CoNeTTEModel

	config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
	model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)

	path = "/your/path/to/audio.wav"
	outputs = model(path)
	candidate = outputs["cands"][0]
	print(candidate)
	```

	The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:

	```py
	import torchaudio

	path_1 = "/your/path/to/audio_1.wav"
	path_2 = "/your/path/to/audio_2.wav"

	audio_1, sr_1 = torchaudio.load(path_1)
	audio_2, sr_2 = torchaudio.load(path_2)

	outputs = model([audio_1, audio_2], sr=[sr_1, sr_2])
	candidates = outputs["cands"]
	print(candidates)
	```

	The model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is "clotho".

	```py
	outputs = model(path, task="clotho")
	candidate = outputs["cands"][0]
	print(candidate)

	outputs = model(path, task="audiocaps")
	candidate = outputs["cands"][0]
	print(candidate)
	```

	## Usage with command line
	Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.

	```bash
	conette-predict --audio "/your/path/to/audio.wav"
	```

	## Performance

	\| Test data \| SPIDEr (%) \| SPIDEr-FL (%) \| FENSE (%) \| Vocab \| Outputs \| Scores \|
	\| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \|
	\| AC-test \| 44.14 \| 43.98 \| 60.81 \| 309 \| [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/outputs_audiocaps_test.csv) \| [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/scores_audiocaps_test.yaml) \|
	\| CL-eval \| 30.97 \| 30.87 \| 51.72 \| 636 \| [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/outputs_clotho_eval.csv) \| [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/conette/scores_clotho_eval.yaml) \|

	This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.

	## Limitations
	The model has been trained on audio sampled at 32 kHz and lasting from 1 to 30 seconds. It can handle longer audio files, but it might give worse results.

	## Citation
	The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf

	```
	@misc{labbé2023conette,
	title = {CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding},
	author = {Étienne Labbé and Thomas Pellegrini and Julien Pinquier},
	year = 2023,
	journal = {arXiv preprint arXiv:2309.00454},
	url = {https://arxiv.org/pdf/2309.00454.pdf},
	eprint = {2309.00454},
	archiveprefix = {arXiv},
	primaryclass = {cs.SD}
	}
	```

	## Additional information

	- Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
	- The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT. More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.

	## Contact
	Maintainer:
	- Etienne Labbé "Labbeti": [email protected]