Labbeti commited on
Commit
e07156c
·
1 Parent(s): f9d36dc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +41 -26
README.md CHANGED
@@ -1,47 +1,52 @@
1
  ---
2
- language: en
3
- license: mit
4
- tags:
5
- - audio
6
- - captioning
7
- - text
8
- - audio-captioning
9
- - automated-audio-captioning
10
- task_categories:
11
- - audio-captioning
12
  ---
 
13
 
14
- # CoNeTTE (ConvNext-Transformer with Task Embedding) for Automated Audio Captioning
15
 
16
- <font color='red'>This model is currently in developement, and all the required files are not yet available.</font>
 
 
 
 
 
 
 
 
17
 
18
- This model generate a short textual description of any audio file.
 
 
 
 
19
 
20
  ## Installation
21
  ```bash
22
- pip install conette
 
23
  ```
24
 
25
- ## Usage
26
  ```py
27
  from conette import CoNeTTEConfig, CoNeTTEModel
28
 
29
  config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
30
  model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
31
 
32
- path = "/my/path/to/audio.wav"
33
  outputs = model(path)
34
  candidate = outputs["cands"][0]
35
  print(candidate)
36
  ```
37
 
38
- The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). IN this second case you also need to provide the sampling rate of this files:
39
 
40
  ```py
41
  import torchaudio
42
 
43
- path_1 = "/my/path/to/audio_1.wav"
44
- path_2 = "/my/path/to/audio_2.wav"
45
 
46
  audio_1, sr_1 = torchaudio.load(path_1)
47
  audio_2, sr_2 = torchaudio.load(path_2)
@@ -63,11 +68,19 @@ candidate = outputs["cands"][0]
63
  print(candidate)
64
  ```
65
 
 
 
 
 
 
 
 
66
  ## Performance
67
- | Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
68
- | ------------- | ------------- | ------------- | ------------- |
69
- | AudioCaps | 44.14 | 43.98 | 60.81 |
70
- | Clotho | 30.97 | 30.87 | 51.72 |
 
71
 
72
  This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
73
 
@@ -89,7 +102,9 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
89
 
90
  ## Additional information
91
 
92
- The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
93
- More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
94
 
95
- It was created by [@Labbeti](https://hf.co/Labbeti).
 
 
 
1
  ---
2
+ {}
 
 
 
 
 
 
 
 
 
3
  ---
4
+ <div align="center">
5
 
6
+ # CoNeTTE model source
7
 
8
+ <a href="https://www.python.org/"><img alt="Python" src="https://img.shields.io/badge/-Python 3.10+-blue?style=for-the-badge&logo=python&logoColor=white"></a>
9
+ <a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white"></a>
10
+ <a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray"></a>
11
+ <a href="https://github.com/Labbeti/conette-audio-captioning/actions">
12
+ <img alt="Build" src="https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/python-package-pip.yaml?branch=main&style=for-the-badge&logo=github">
13
+ </a>
14
+ <!-- <a href='https://aac-metrics.readthedocs.io/en/stable/?badge=stable'>
15
+ <img src='https://readthedocs.org/projects/aac-metrics/badge/?version=stable&style=for-the-badge' alt='Documentation Status' />
16
+ </a> -->
17
 
18
+ CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file.
19
+
20
+ </div>
21
+
22
+ CoNeTTE has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. CoNeTTE stands for ConvNeXt-Transformer model with Task Embedding, and the architecture and training is explained in the corresponding [paper](https://arxiv.org/pdf/2309.00454.pdf).
23
 
24
  ## Installation
25
  ```bash
26
+ python -m pip install conette
27
+ python -m spacy download en_core_web_sm
28
  ```
29
 
30
+ ## Usage with python
31
  ```py
32
  from conette import CoNeTTEConfig, CoNeTTEModel
33
 
34
  config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
35
  model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
36
 
37
+ path = "/your/path/to/audio.wav"
38
  outputs = model(path)
39
  candidate = outputs["cands"][0]
40
  print(candidate)
41
  ```
42
 
43
+ The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:
44
 
45
  ```py
46
  import torchaudio
47
 
48
+ path_1 = "/your/path/to/audio_1.wav"
49
+ path_2 = "/your/path/to/audio_2.wav"
50
 
51
  audio_1, sr_1 = torchaudio.load(path_1)
52
  audio_2, sr_2 = torchaudio.load(path_2)
 
68
  print(candidate)
69
  ```
70
 
71
+ ## Usage with command line
72
+ Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.
73
+
74
+ ```bash
75
+ conette-predict --audio "/your/path/to/audio.wav"
76
+ ```
77
+
78
  ## Performance
79
+
80
+ | Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |
81
+ | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
82
+ | AC-test | 44.14 | 43.98 | 60.81 | 309 | [:clipboard:](results/conette/outputs_audiocaps_test.csv) | [:chart_with_upwards_trend:](results/conette/scores_audiocaps_test.yaml) |
83
+ | CL-eval | 30.97 | 30.87 | 51.72 | 636 | [:clipboard:](results/conette/outputs_clotho_eval.csv) | [:chart_with_upwards_trend:](results/conette/scores_clotho_eval.yaml) |
84
 
85
  This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
86
 
 
102
 
103
  ## Additional information
104
 
105
+ - Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
106
+ - The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT. More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
107
 
108
+ ## Contact
109
+ Maintainer:
110
+ - Etienne Labbé "Labbeti": [email protected]