Audio-Text-to-Text
glap_model
richermans commited on
Commit
9527571
·
verified ·
1 Parent(s): 8a4a67d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +204 -0
README.md CHANGED
@@ -1,3 +1,207 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+
6
+ <div align="center">
7
+ <h1>
8
+ GLAP (Generalized Language Audio Pretraining)
9
+ </h1>
10
+ <p>
11
+ Official PyTorch code for <b>GLAP</b> <br>
12
+ <b><em>Generalized Language Audio Pretraining</em></b>
13
+ </p>
14
+ </p>
15
+ <a href="https://arxiv.org/abs/2406.06992"><img src="https://img.shields.io/badge/" alt="version"></a>
16
+ <a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a>
17
+ <a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a>
18
+ <a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a>
19
+ <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a>
20
+ <img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads">
21
+
22
+ </div>
23
+
24
+
25
+
26
+
27
+ # GLAP (Generalized Language Audio Pretraining)
28
+
29
+
30
+ <img src="resources/capabilities.png" alt="GLAP capabiltiies" style="height: 600px;">
31
+
32
+
33
+ ## Features
34
+
35
+
36
+ * *First* all-in-one solution for general audio-text retrieval.
37
+ * Multilingual (8 + Languages) Speech, Music and Sound retrieval.
38
+ * Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more.
39
+
40
+
41
+ ## Usage
42
+
43
+
44
+ ```bash
45
+ pip install glap_model
46
+ ```
47
+
48
+
49
+ ### Scoring audio-text pairs
50
+
51
+ We provide a simple commandline tool:
52
+
53
+ ```bash
54
+ score_glap audio_input_file text1;text2;text3
55
+ ```
56
+
57
+ Or in Python:
58
+
59
+ ```python
60
+ import torch
61
+ from glap_model import glap_inference
62
+
63
+ audio = torch.randn(1, 160000).tanh() # 10s of heavy noise
64
+
65
+ glap_model = glap_inference()
66
+
67
+ score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"])
68
+ print(score)
69
+ ```
70
+
71
+
72
+
73
+ ### Recommended Prompts
74
+
75
+ | Task | Prompt |
76
+ |--------|-----------------------------------------|
77
+ | Speech | {label} |
78
+ | Music | The music in the style of {label}. |
79
+ | Sound | The sound of {label} can be heard. |
80
+
81
+
82
+ ### Batched scoring
83
+
84
+
85
+ ```python
86
+ import torch
87
+ from glap_model import glap_inference
88
+
89
+ glap_model = glap_inference()
90
+ audio = torch.randn(1, 64000).tanh()
91
+ prefix = "The sound of"
92
+ labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")]
93
+ text_embeds = glap_model.encode_text(labels)
94
+ audio_embeds = glap_model.encode_audio(audio)
95
+ scores = glap_model.score(audio_embeds, text_embeds)
96
+ for label_name, score in zip(labels, scores):
97
+ print(label_name,score)
98
+
99
+
100
+ ```
101
+
102
+ ## Development
103
+
104
+
105
+ ### UV (Recommended)
106
+
107
+ ```bash
108
+ git clone https://github.com/xiaomi-research/GLAP
109
+ cd GLAP
110
+ uv venv --python 3.10
111
+ source activate .venv/bin/activate
112
+ uv sync
113
+
114
+ #python3 -m pip install .
115
+ # Additionally, sndfile is needed
116
+ # conda install -c conda-forge libsndfile==1.0.31
117
+ ```
118
+
119
+ ### Pip
120
+
121
+ ```bash
122
+ git clone https://github.com/xiaomi-research/GLAP
123
+ cd GLAP
124
+ python3 -m pip install .
125
+ # Additionally, sndfile is needed
126
+ # conda install -c conda-forge libsndfile==1.0.31
127
+ # Or if you have root, use your package manager
128
+ ```
129
+
130
+
131
+ ### Prepare data
132
+
133
+
134
+ Data needs to be in `tar/tar.gz` format:
135
+
136
+ ```
137
+ # tar -tf a.tar
138
+ 908-31957-0013.flac
139
+ 908-31957-0013.json
140
+ 2961-960-0013.flac
141
+ 2961-960-0013.json
142
+ ```
143
+
144
+
145
+ Each `.json` should have one of three fields `caption`, `captions` or `text`.
146
+ Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency.
147
+ Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training).
148
+
149
+ ### Training
150
+
151
+
152
+ For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`.
153
+
154
+
155
+ ```bash
156
+ accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml
157
+ ```
158
+
159
+
160
+ ### Zeroshot eval (one sample)
161
+
162
+
163
+ ```bash
164
+ # There ; is a separator for different text keys
165
+ python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three"
166
+ ```
167
+
168
+ ### Retrieval scoring
169
+
170
+ ```bash
171
+ # Should be run on a single GPU
172
+ accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT
173
+ ```
174
+
175
+
176
+
177
+ ### Notes on DDP
178
+
179
+ Using uneven training datasets without `resample=True` is not recommended
180
+
181
+
182
+ ## Translating data into a target language
183
+
184
+ For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code:
185
+
186
+
187
+ ```bash
188
+ python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
189
+ ```
190
+
191
+ DDP is also supported:
192
+
193
+ ```bash
194
+ accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
195
+ ```
196
+
197
+
198
+ ## Citation
199
+
200
+ TODO
201
+ ```bibtex
202
+ @inproceedings{dinkel2025glap,
203
+ title={GLAP: General contrastive audio-text pretraining across domains and languages},
204
+ year={2025}
205
+ }
206
+ ```
207
+