File size: 14,993 Bytes
50feb15 f94d6d5 961d3b5 b27ec7f 7566c28 bf2749e 7ae264c eb3ffc5 50feb15 f94d6d5 17a69b8 50feb15 9aebcff f319dd8 50feb15 9aebcff 50feb15 9aebcff 50feb15 9aebcff 50feb15 9aebcff b27ec7f 9aebcff b27ec7f 9aebcff 17a69b8 9aebcff 1458565 1c16446 9aebcff 50feb15 f94d6d5 1c16446 f94d6d5 1c16446 f94d6d5 50feb15 1c16446 355625c 50feb15 9aebcff f319dd8 1c16446 9aebcff 1c16446 9aebcff 1c16446 355625c 1c16446 9aebcff 50feb15 9aebcff 1c16446 9aebcff 1c16446 9aebcff 1c16446 9aebcff 1c16446 9aebcff 1c16446 9aebcff 1c16446 9aebcff 1c16446 355625c 1c16446 355625c 1c16446 355625c 1c16446 355625c 1c16446 355625c 1c16446 9aebcff 355625c 1c16446 355625c 1c16446 f94d6d5 1c16446 f94d6d5 1c16446 f94d6d5 1c16446 f94d6d5 1c16446 f94d6d5 b27ec7f f94d6d5 a10df5a f94d6d5 b27ec7f f94d6d5 b27ec7f f94d6d5 b27ec7f f94d6d5 b27ec7f f94d6d5 b27ec7f f94d6d5 b27ec7f f94d6d5 b27ec7f f94d6d5 1c16446 f94d6d5 f9ca50e f94d6d5 f9ca50e f94d6d5 b27ec7f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 |
---
license: mit
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
- yue
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: feature-extraction
tags:
- music
---
# **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
[](https://sanderwood.github.io/clamp3/)
[](https://arxiv.org/abs/2502.10362)
[](https://github.com/sanderwood/clamp3)
[](https://huggingface.co/spaces/sander-wood/clamp3)
[](https://huggingface.co/sander-wood/clamp3/tree/main)
[](https://huggingface.co/datasets/sander-wood/m4-rag)
[](https://huggingface.co/datasets/sander-wood/wikimt-x)
<p align="center">
<img src="overview.png" alt="CLaMP 3 Overview" width="50%">
</p>
## **Overview**
CLaMP 3 is a **state-of-the-art** framework for **music information retrieval (MIR)** across multiple **modalities** (βοΈ **text**, πΌ **sheet music**, π΅ **audio**, πΉ **MIDI**, and πΌοΈ **images**) and **languages** (π 27 trained, 100 supported). It leverages **contrastive learning** to align diverse music modalities into a **shared representation space**, enabling seamless cross-modal retrieval. You can think of it as a more comprehensive version of CLAP or MuLanβwith much stronger performance, support for all major music modalities, and global language coverage.
π **Why CLaMP 3?**
β
**Multimodal**: Works with βοΈ **text**, πΌ **sheet music**, π΅ **audio**, πΉ **MIDI**, and πΌοΈ **images**
β
**Multilingual**: Supports **27 trained** & generalizes to π **100 languages**
β
**SOTA Performance**: Significantly **outperforms previous strong baselines** across modalities and languages π
## β¨ **Key Features**
### **Multimodal Support**
- **Sheet Music**: Interleaved ABC notation (**512 bars**)
- **Performance Signals**: MIDI Text Format (**512 MIDI messages**)
- **Audio Recordings**: [MERT](https://arxiv.org/abs/2306.00107) features (**640 sec of audio**)
### **Multilingual Capabilities**
- Trained on **27 languages**, generalizes to **100 languages** using [XLM-R](https://arxiv.org/abs/1911.02116)
### **Visual Semantic Understanding**
- Learns visual semantics (e.g., image captions) for tasks like **image-to-music retrieval**
### **Datasets & Benchmarks**
- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag)**: **2.31M music-text pairs** π
- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x)**: **1,000 music triplets**
## π₯ **What Can CLaMP 3 Do?**
π‘ **Text-to-Music Retrieval**: Search music with text (100 languages!)
πΈ **Image-to-Music Retrieval**: Match music to images π¨
π **Cross-Modal Retrieval**: Find related music across different modalities
π οΈ **Zero-Shot Classification**: Identify genre, mood, style, & more π·οΈ
πΌ **Semantic Similarity**: Measure semantic similarity between generated & reference music
π **Check it out**: [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/)
## **Quick Start Guide**
For users who want to get started quickly with CLaMP3, follow these steps:
### **Install the Environment**
Run the following commands:
```bash
conda create -n clamp3 python=3.10.16 -y
conda activate clamp3
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install -r requirements.txt
```
### **Overview of `clamp3_*.py` Scripts**
CLaMP 3 provides scripts for **semantic search**, **semantic similarity calculation**, **retrieval performance evaluation**, and **feature extraction** across five modalities. Simply provide the file path, and the script will automatically detect the modality and extract the relevant features.
Supported formats include:
- **Audio**: `.mp3`, `.wav`
- **Performance Signals**: `.mid`, `.midi`
- **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
- **Images**: `.png`, `.jpg`
- **Text**: `.txt` (in 100 languages)
#### **Feature Management**
- Extracted features are stored in the `cache/` directory and reused in future runs to avoid recomputation.
- Temporary files are saved in `temp/` and cleaned up after each run.
> **Note**: All files in a folder must belong to the same modality for processing.
#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Semantic Search**
Run retrieval tasks by comparing a query file to reference files in `ref_dir`. The query and `ref_dir` can be **any modality**, so there are **25 possible retrieval combinations**, e.g., text-to-music, image-to-music, music-to-music, music-to-text (zero-shot music classification), etc.
```bash
python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
```
#### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
This script calculates semantic similarity between query and reference files. By default, it uses **pairwise mode**, but you can switch to **group mode** using the `--group` flag.
```bash
python clamp3_score.py <query_dir> <ref_dir> [--group]
```
- **Pairwise Mode (default)**:
Compares files with **matching prefixes** and **identical folder structures**.
**Folder structure example**:
```
query_dir/
βββ en/
β βββ sample1.wav
βββ zh/
β βββ sample1.1.wav
β βββ sample1.2.wav
β βββ sample2.wav
ref_dir/
βββ en/
β βββ sample1.txt
βββ zh/
β βββ sample1.txt
β βββ sample2.txt
```
- Files with the **same prefix** (before the first dot) are treated as pairs (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`).
- Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `ref_dir/zh/sample1.txt`).
**Important**:
- **Pairwise mode** can be **slow** for large datasets.
- If you have a large dataset, **switch to group mode** for faster computation.
- **Group Mode**:
Compares **all query files** to **all reference files** and calculates the average similarity.
**Enable Group Mode**:
```bash
python clamp3_score.py query_dir ref_dir --group
```
#### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Retrieval Performance Evaluation**
Evaluates **CLaMP3's retrieval performance** on a paired dataset using metrics like **MRR** and **Hit@K**. Works the same way as **pairwise mode** in `clamp3_score.py`βrequiring **matching folder structure** and **filenames** between `query_dir` and `ref_dir`.
```bash
python clamp3_eval.py <query_dir> <ref_dir>
```
#### **[`clamp3_embd.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_embd.py) - Feature Extraction**
If other scripts don't meet your needs, use `clamp3_embd.py` to extract features.
```bash
python clamp3_embd.py <input_dir_path> <output_dir_path> [--get_global]
```
**Feature Output:**
- **Without `--get_global`** β Shape: **(1, T, 768)** (T = time steps). Uses last hidden states before avg pooling, ideal for applications needing temporal info. Fine-tuning recommended.
- **With `--get_global`** β Shape: **(1, 768)**. Uses avg pooled features, suitable for applications needing global info, can be used directly.
## **Repository Structure**
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** β Training & feature extraction scripts.
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** β Linear classification training and prediction.
- **[inference/](https://github.com/sanderwood/clamp3/tree/main/inference)** β Semantic search, similarity calculations, and retrieval evaluation.
- **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** β Convert data into Interleaved ABC, MTF, or MERT-extracted features.
> **Note:** Ensure the model weights are placed in the `code/` folder, and verify the configuration hyperparameters before use.
## **Key Script Overview**
### **Data Preparation**
#### **1. Convert Music Data to Compatible Formats**
Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.
##### **1.1 Convert MusicXML to Interleaved ABC Notation**
CLaMP 3 requires **Interleaved ABC notation** for sheet music. Follow these steps:
1. Convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):
```bash
python batch_xml2abc.py <input_dir> <output_dir>
```
- **Input:** Directory containing `.mxl`, `.xml`, `.musicxml` files
- **Output:** Directory where converted `.abc` (Standard ABC) files will be saved
2. Convert **Standard ABC** into **Interleaved ABC** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):
```bash
python batch_interleaved_abc.py <input_dir> <output_dir>
```
- **Input:** Directory containing `.abc` (Standard ABC) files
- **Output:** Directory where Interleaved ABC files will be saved *(for CLaMP 3 use)*
##### **1.2 Convert MIDI to MTF Format**
CLaMP 3 processes performance signals in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):
```bash
python batch_midi2mtf.py <input_dir> <output_dir> --m3_compatible
```
- **Input:** Directory containing `.mid`, `.midi` files
- **Output:** Directory where `.mtf` files will be saved *(MTF format for CLaMP 3)*
- **Important:** The `--m3_compatible` flag **must be included** to ensure the output format is compatible with CLaMP 3. Without this flag, the extracted MTF files **will not work** correctly in the pipeline.
##### **1.3 Extract Audio Features using MERT**
For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract MERT-based features from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):
```bash
python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path m-a-p/MERT-v1-95M --mean_features
```
- **Input:** `.mp3`, `.wav`
- **Output:** `.npy` *(Processed audio features for CLaMP 3)*
### **Training and Feature Extraction**
#### **1. Training Models**
CLaMP 3 is the most powerful music retrieval model, and in most cases, retraining is not needed. However, if necessary, follow these steps.
1. Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters** and **data paths**.
2. Train on your own data.
To train CLaMP 3 on **symbolic music** (e.g., sheet music, MIDI), run:
```bash
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
```
For **audio data**, use:
```bash
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
```
##### **Using Pre-Trained Models (Recommended)**
For most use cases, it's best to use pre-trained weights instead of training from scratch.
| Version | Best for | Download Link |
|---------|---------|--------------|
| **CLaMP 3 SAAS** | **Audio-based retrieval (Recommended)** | [Download SAAS](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
| **CLaMP 3 C2** | **Symbolic music retrieval (Sheet music, MIDI)** | [Download C2](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
##### **How to Switch Between Versions?**
By default, CLaMP 3 is configured for the **SAAS version** (optimized for audio).
- If working with **symbolic music (MIDI, sheet music)**, use the **C2 version**:
**Modify line 66 in `config.py`** from `"saas"` to `"c2"`.
#### **2. Feature Extraction**
After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):
```bash
accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> --get_global
```
- **`--epoch <epoch>`:** (Optional) Specify the checkpoint epoch.
- **`<input_dir>`:** Directory containing the input files.
- **`<output_dir>`:** Destination folder for the output `.npy` features.
- **`--get_global`**: **(Required for retrieval!)** Extracts a **global semantic vector** for each input.
All extracted features are stored as `.npy` files.
> **Note**: For retrieval, `--get_global` must be used. Without it, CLaMP 3 will not work correctly for retrieval tasks. You only omit `--get_global` if you are performing downstream fine-tuning or need raw feature extraction for custom tasks.
## **Citation**
If you find CLaMP 3 useful in your work, please consider citing our paper:
```bibtex
@misc{wu2025clamp3universalmusic,
title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages},
author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
year={2025},
eprint={2502.10362},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.10362}
}
``` |