Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -31,16 +31,46 @@ model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
|
|
31 |
|
32 |
path = "/my/path/to/audio.wav"
|
33 |
outputs = model(path)
|
34 |
-
|
35 |
-
print(
|
36 |
```
|
37 |
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
| Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
|
40 |
| ------------- | ------------- | ------------- | ------------- |
|
41 |
| AudioCaps | 44.14 | 43.98 | 60.81 |
|
42 |
| Clotho | 30.97 | 30.87 | 51.72 |
|
43 |
|
|
|
|
|
44 |
## Citation
|
45 |
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
|
46 |
|
@@ -60,6 +90,6 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
|
|
60 |
## Additional information
|
61 |
|
62 |
The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
|
63 |
-
|
64 |
|
65 |
It was created by [@Labbeti](https://hf.co/Labbeti).
|
|
|
31 |
|
32 |
path = "/my/path/to/audio.wav"
|
33 |
outputs = model(path)
|
34 |
+
candidate = outputs["cands"][0]
|
35 |
+
print(candidate)
|
36 |
```
|
37 |
|
38 |
+
The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). IN this second case you also need to provide the sampling rate of this files:
|
39 |
+
|
40 |
+
```py
|
41 |
+
import torchaudio
|
42 |
+
|
43 |
+
path_1 = "/my/path/to/audio_1.wav"
|
44 |
+
path_2 = "/my/path/to/audio_2.wav"
|
45 |
+
|
46 |
+
audio_1, sr_1 = torchaudio.load(path_1)
|
47 |
+
audio_2, sr_2 = torchaudio.load(path_2)
|
48 |
+
|
49 |
+
outputs = model([audio_1, audio_2], sr=[sr_1, sr_2])
|
50 |
+
candidates = outputs["cands"]
|
51 |
+
print(candidates)
|
52 |
+
```
|
53 |
+
|
54 |
+
The model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is "clotho".
|
55 |
+
|
56 |
+
```py
|
57 |
+
outputs = model(path, task="clotho")
|
58 |
+
candidate = outputs["cands"][0]
|
59 |
+
print(candidate)
|
60 |
+
|
61 |
+
outputs = model(path, task="audiocaps")
|
62 |
+
candidate = outputs["cands"][0]
|
63 |
+
print(candidate)
|
64 |
+
```
|
65 |
+
|
66 |
+
## Performance
|
67 |
| Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
|
68 |
| ------------- | ------------- | ------------- | ------------- |
|
69 |
| AudioCaps | 44.14 | 43.98 | 60.81 |
|
70 |
| Clotho | 30.97 | 30.87 | 51.72 |
|
71 |
|
72 |
+
This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
|
73 |
+
|
74 |
## Citation
|
75 |
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
|
76 |
|
|
|
90 |
## Additional information
|
91 |
|
92 |
The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
|
93 |
+
More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
|
94 |
|
95 |
It was created by [@Labbeti](https://hf.co/Labbeti).
|