|
--- |
|
license: apache-2.0 |
|
pipeline_tag: audio-text-to-text |
|
library_name: glap_model |
|
--- |
|
|
|
<div align="center"> |
|
<h1> |
|
GLAP (Generalized Language Audio Pretraining) |
|
</h1> |
|
<p> |
|
Official PyTorch code for <b>GLAP</b> <br> |
|
<b><em>Generalized Language Audio Pretraining</em></b> |
|
</p> |
|
</p> |
|
<a href="https://arxiv.org/abs/2506.11350"><img src="https://img.shields.io/badge/arXiv-2506.11350-b31b1b" alt="version"></a> |
|
<a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a> |
|
<a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a> |
|
<a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a> |
|
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a> |
|
<img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads"> |
|
|
|
</div> |
|
|
|
|
|
|
|
|
|
# GLAP (Generalized Language Audio Pretraining) |
|
|
|
|
|
<img src="capabilities.png" alt="GLAP capabiltiies" style="height: 600px;"> |
|
|
|
|
|
## Features |
|
|
|
|
|
* *First* all-in-one solution for general audio-text retrieval. |
|
* Multilingual (8 + Languages) Speech, Music and Sound retrieval. |
|
* Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```bash |
|
pip install glap_model |
|
``` |
|
|
|
|
|
### Scoring audio-text pairs |
|
|
|
We provide a simple commandline tool: |
|
|
|
```bash |
|
score_glap audio_input_file text1;text2;text3 |
|
``` |
|
|
|
Or in Python: |
|
|
|
```python |
|
import torch |
|
from glap_model import glap_inference |
|
|
|
audio = torch.randn(1, 160000).tanh() # 10s of heavy noise |
|
|
|
glap_model = glap_inference() |
|
|
|
score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"]) |
|
print(score) |
|
``` |
|
|
|
|
|
|
|
### Recommended Prompts |
|
|
|
| Task | Prompt | |
|
|--------|-----------------------------------------| |
|
| Speech | {label} | |
|
| Music | The music in the style of {label}. | |
|
| Sound | The sound of {label} can be heard. | |
|
|
|
|
|
### Batched scoring |
|
|
|
|
|
```python |
|
import torch |
|
from glap_model import glap_inference |
|
|
|
glap_model = glap_inference() |
|
audio = torch.randn(1, 64000).tanh() |
|
prefix = "The sound of" |
|
labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")] |
|
text_embeds = glap_model.encode_text(labels) |
|
audio_embeds = glap_model.encode_audio(audio) |
|
scores = glap_model.score(audio_embeds, text_embeds) |
|
for label_name, score in zip(labels, scores): |
|
print(label_name,score) |
|
|
|
|
|
``` |
|
|
|
## Development |
|
|
|
|
|
### UV (Recommended) |
|
|
|
```bash |
|
git clone https://github.com/xiaomi-research/GLAP |
|
cd GLAP |
|
uv venv --python 3.10 |
|
source activate .venv/bin/activate |
|
uv sync |
|
|
|
#python3 -m pip install . |
|
# Additionally, sndfile is needed |
|
# conda install -c conda-forge libsndfile==1.0.31 |
|
``` |
|
|
|
### Pip |
|
|
|
```bash |
|
git clone https://github.com/xiaomi-research/GLAP |
|
cd GLAP |
|
python3 -m pip install . |
|
# Additionally, sndfile is needed |
|
# conda install -c conda-forge libsndfile==1.0.31 |
|
# Or if you have root, use your package manager |
|
``` |
|
|
|
|
|
### Prepare data |
|
|
|
|
|
Data needs to be in `tar/tar.gz` format: |
|
|
|
``` |
|
# tar -tf a.tar |
|
908-31957-0013.flac |
|
908-31957-0013.json |
|
2961-960-0013.flac |
|
2961-960-0013.json |
|
``` |
|
|
|
|
|
Each `.json` should have one of three fields `caption`, `captions` or `text`. |
|
Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency. |
|
Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training). |
|
|
|
### Training |
|
|
|
|
|
For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`. |
|
|
|
|
|
```bash |
|
accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml |
|
``` |
|
|
|
|
|
### Zeroshot eval (one sample) |
|
|
|
|
|
```bash |
|
# There ; is a separator for different text keys |
|
python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three" |
|
``` |
|
|
|
### Retrieval scoring |
|
|
|
```bash |
|
# Should be run on a single GPU |
|
accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT |
|
``` |
|
|
|
|
|
|
|
### Notes on DDP |
|
|
|
Using uneven training datasets without `resample=True` is not recommended |
|
|
|
|
|
## Translating data into a target language |
|
|
|
For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code: |
|
|
|
|
|
```bash |
|
python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/ |
|
``` |
|
|
|
DDP is also supported: |
|
|
|
```bash |
|
accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/ |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{2506.11350, |
|
Author = {Heinrich Dinkel and Zhiyong Yan and Tianzi Wang and Yongqing Wang and Xingwei Sun and Yadong Niu and Jizhong Liu and Gang Li and Junbo Zhang and Jian Luan}, |
|
Title = {GLAP: General contrastive audio-text pretraining across domains and languages}, |
|
Year = {2025}, |
|
Eprint = {arXiv:2506.11350}, |
|
} |
|
``` |