Feature Extraction
music
File size: 14,993 Bytes
50feb15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f94d6d5
961d3b5
b27ec7f
7566c28
bf2749e
7ae264c
eb3ffc5
 
50feb15
 
 
 
 
f94d6d5
17a69b8
50feb15
9aebcff
 
f319dd8
 
50feb15
9aebcff
50feb15
9aebcff
 
 
 
50feb15
9aebcff
 
50feb15
9aebcff
 
b27ec7f
9aebcff
 
 
b27ec7f
9aebcff
 
 
 
17a69b8
 
 
9aebcff
 
1458565
1c16446
9aebcff
 
 
 
50feb15
f94d6d5
1c16446
f94d6d5
1c16446
 
f94d6d5
50feb15
1c16446
355625c
50feb15
9aebcff
 
 
 
 
f319dd8
1c16446
9aebcff
 
 
1c16446
9aebcff
1c16446
355625c
 
 
 
 
 
 
 
1c16446
 
9aebcff
50feb15
 
9aebcff
1c16446
 
9aebcff
 
1c16446
9aebcff
1c16446
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9aebcff
 
1c16446
9aebcff
 
 
1c16446
9aebcff
 
1c16446
9aebcff
 
 
 
1c16446
355625c
1c16446
355625c
1c16446
 
355625c
1c16446
 
355625c
1c16446
355625c
1c16446
9aebcff
355625c
1c16446
 
355625c
 
 
 
1c16446
 
 
 
 
 
 
 
 
 
 
 
 
 
f94d6d5
1c16446
f94d6d5
1c16446
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f94d6d5
1c16446
f94d6d5
1c16446
 
 
f94d6d5
 
b27ec7f
f94d6d5
 
a10df5a
f94d6d5
 
 
 
b27ec7f
 
 
 
f94d6d5
b27ec7f
f94d6d5
b27ec7f
 
 
f94d6d5
 
 
b27ec7f
f94d6d5
 
 
 
b27ec7f
 
f94d6d5
b27ec7f
 
 
 
f94d6d5
b27ec7f
 
 
 
 
f94d6d5
 
 
 
1c16446
f94d6d5
f9ca50e
 
 
 
f94d6d5
 
 
f9ca50e
f94d6d5
 
b27ec7f
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
license: mit
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh  
- yue  
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: feature-extraction
tags:
- music
---
# **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
[![Homepage](https://img.shields.io/badge/CLaMP%203%20Homepage-GitHub-181717?style=for-the-badge&logo=home-assistant)](https://sanderwood.github.io/clamp3/)
[![Paper](https://img.shields.io/badge/CLaMP%203%20Paper-Arxiv-red?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2502.10362)
[![GitHub](https://img.shields.io/badge/CLaMP%203%20Code-GitHub-181717?style=for-the-badge&logo=github)](https://github.com/sanderwood/clamp3)
[![Demo](https://img.shields.io/badge/CLaMP%203%20Demo-Gradio-green?style=for-the-badge&logo=gradio)](https://huggingface.co/spaces/sander-wood/clamp3)
[![Hugging Face](https://img.shields.io/badge/Model%20Weights-Hugging%20Face-ffcc00?style=for-the-badge&logo=huggingface)](https://huggingface.co/sander-wood/clamp3/tree/main)
[![Dataset](https://img.shields.io/badge/M4--RAG%20Dataset-Hugging%20Face-ffcc00?style=for-the-badge&logo=huggingface)](https://huggingface.co/datasets/sander-wood/m4-rag)
[![Benchmark](https://img.shields.io/badge/WikiMT--X%20Benchmark-Hugging%20Face-ffcc00?style=for-the-badge&logo=huggingface)](https://huggingface.co/datasets/sander-wood/wikimt-x)

<p align="center">
  <img src="overview.png" alt="CLaMP 3 Overview" width="50%">
</p>

## **Overview**
CLaMP 3 is a **state-of-the-art** framework for **music information retrieval (MIR)** across multiple **modalities** (✍️ **text**, 🎼 **sheet music**, 🎡 **audio**, 🎹 **MIDI**, and πŸ–ΌοΈ **images**) and **languages** (🌐 27 trained, 100 supported). It leverages **contrastive learning** to align diverse music modalities into a **shared representation space**, enabling seamless cross-modal retrieval. You can think of it as a more comprehensive version of CLAP or MuLanβ€”with much stronger performance, support for all major music modalities, and global language coverage.

πŸš€ **Why CLaMP 3?**  
βœ… **Multimodal**: Works with ✍️ **text**, 🎼 **sheet music**, 🎡 **audio**, 🎹 **MIDI**, and πŸ–ΌοΈ **images**  
βœ… **Multilingual**: Supports **27 trained** & generalizes to 🌐 **100 languages**  
βœ… **SOTA Performance**: Significantly **outperforms previous strong baselines** across modalities and languages πŸ“Š

## ✨ **Key Features**  

### **Multimodal Support**  
- **Sheet Music**: Interleaved ABC notation (**512 bars**)  
- **Performance Signals**: MIDI Text Format (**512 MIDI messages**)  
- **Audio Recordings**: [MERT](https://arxiv.org/abs/2306.00107) features (**640 sec of audio**)  

### **Multilingual Capabilities**  
- Trained on **27 languages**, generalizes to **100 languages** using [XLM-R](https://arxiv.org/abs/1911.02116)  

### **Visual Semantic Understanding**  
- Learns visual semantics (e.g., image captions) for tasks like **image-to-music retrieval** 

### **Datasets & Benchmarks**  
- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag)**: **2.31M music-text pairs** 🌎  
- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x)**: **1,000 music triplets**  

## πŸ”₯ **What Can CLaMP 3 Do?**  

πŸ’‘ **Text-to-Music Retrieval**: Search music with text (100 languages!)  
πŸ“Έ **Image-to-Music Retrieval**: Match music to images 🎨  
πŸ”„ **Cross-Modal Retrieval**: Find related music across different modalities  
πŸ› οΈ **Zero-Shot Classification**: Identify genre, mood, style, & more 🏷️  
🎼 **Semantic Similarity**: Measure semantic similarity between generated & reference music  

πŸ‘‰ **Check it out**: [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/)  

## **Quick Start Guide**  
For users who want to get started quickly with CLaMP3, follow these steps:

### **Install the Environment**  
Run the following commands:

```bash
conda create -n clamp3 python=3.10.16 -y
conda activate clamp3
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install -r requirements.txt
```

### **Overview of `clamp3_*.py` Scripts**  
CLaMP 3 provides scripts for **semantic search**, **semantic similarity calculation**, **retrieval performance evaluation**, and **feature extraction** across five modalities. Simply provide the file path, and the script will automatically detect the modality and extract the relevant features.

Supported formats include:
- **Audio**: `.mp3`, `.wav`
- **Performance Signals**: `.mid`, `.midi`
- **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
- **Images**: `.png`, `.jpg`
- **Text**: `.txt` (in 100 languages)

#### **Feature Management**  
- Extracted features are stored in the `cache/` directory and reused in future runs to avoid recomputation.
- Temporary files are saved in `temp/` and cleaned up after each run.

> **Note**: All files in a folder must belong to the same modality for processing.

#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Semantic Search**  

Run retrieval tasks by comparing a query file to reference files in `ref_dir`. The query and `ref_dir` can be **any modality**, so there are **25 possible retrieval combinations**, e.g., text-to-music, image-to-music, music-to-music, music-to-text (zero-shot music classification), etc.

```bash
python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
```

#### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**

This script calculates semantic similarity between query and reference files. By default, it uses **pairwise mode**, but you can switch to **group mode** using the `--group` flag.

```bash
python clamp3_score.py <query_dir> <ref_dir> [--group]
```

- **Pairwise Mode (default)**:  
  Compares files with **matching prefixes** and **identical folder structures**.

  **Folder structure example**:
  ```
  query_dir/
  β”œβ”€β”€ en/
  β”‚   β”œβ”€β”€ sample1.wav  
  β”œβ”€β”€ zh/
  β”‚   β”œβ”€β”€ sample1.1.wav   
  β”‚   β”œβ”€β”€ sample1.2.wav    
  β”‚   β”œβ”€β”€ sample2.wav  

  ref_dir/
  β”œβ”€β”€ en/
  β”‚   β”œβ”€β”€ sample1.txt
  β”œβ”€β”€ zh/
  β”‚   β”œβ”€β”€ sample1.txt
  β”‚   β”œβ”€β”€ sample2.txt
  ```

  - Files with the **same prefix** (before the first dot) are treated as pairs (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`).
  - Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `ref_dir/zh/sample1.txt`).

  **Important**:  
  - **Pairwise mode** can be **slow** for large datasets.  
  - If you have a large dataset, **switch to group mode** for faster computation.

- **Group Mode**:  
  Compares **all query files** to **all reference files** and calculates the average similarity.

  **Enable Group Mode**:
  ```bash
  python clamp3_score.py query_dir ref_dir --group
  ```

#### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Retrieval Performance Evaluation**  

Evaluates **CLaMP3's retrieval performance** on a paired dataset using metrics like **MRR** and **Hit@K**. Works the same way as **pairwise mode** in `clamp3_score.py`β€”requiring **matching folder structure** and **filenames** between `query_dir` and `ref_dir`.

```bash
python clamp3_eval.py <query_dir> <ref_dir>
```

#### **[`clamp3_embd.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_embd.py) - Feature Extraction**  

If other scripts don't meet your needs, use `clamp3_embd.py` to extract features.

```bash
python clamp3_embd.py <input_dir_path> <output_dir_path> [--get_global]
```

**Feature Output:**
  - **Without `--get_global`** β†’ Shape: **(1, T, 768)** (T = time steps). Uses last hidden states before avg pooling, ideal for applications needing temporal info. Fine-tuning recommended.
  - **With `--get_global`** β†’ Shape: **(1, 768)**. Uses avg pooled features, suitable for applications needing global info, can be used directly.

## **Repository Structure**
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** β†’ Training & feature extraction scripts.
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** β†’ Linear classification training and prediction.  
- **[inference/](https://github.com/sanderwood/clamp3/tree/main/inference)** β†’ Semantic search, similarity calculations, and retrieval evaluation.  
- **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** β†’ Convert data into Interleaved ABC, MTF, or MERT-extracted features.  

> **Note:** Ensure the model weights are placed in the `code/` folder, and verify the configuration hyperparameters before use.

## **Key Script Overview**
### **Data Preparation**
#### **1. Convert Music Data to Compatible Formats**
Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.

##### **1.1 Convert MusicXML to Interleaved ABC Notation**  

CLaMP 3 requires **Interleaved ABC notation** for sheet music. Follow these steps:

1. Convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):  
   ```bash
   python batch_xml2abc.py <input_dir> <output_dir>
   ```
   - **Input:** Directory containing `.mxl`, `.xml`, `.musicxml` files  
   - **Output:** Directory where converted `.abc` (Standard ABC) files will be saved  

2. Convert **Standard ABC** into **Interleaved ABC** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):  
   ```bash
   python batch_interleaved_abc.py <input_dir> <output_dir>
   ```
   - **Input:** Directory containing `.abc` (Standard ABC) files  
   - **Output:** Directory where Interleaved ABC files will be saved *(for CLaMP 3 use)*  

##### **1.2 Convert MIDI to MTF Format**  

CLaMP 3 processes performance signals in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):  
```bash
python batch_midi2mtf.py <input_dir> <output_dir> --m3_compatible
```
- **Input:** Directory containing `.mid`, `.midi` files  
- **Output:** Directory where `.mtf` files will be saved *(MTF format for CLaMP 3)*  
- **Important:** The `--m3_compatible` flag **must be included** to ensure the output format is compatible with CLaMP 3. Without this flag, the extracted MTF files **will not work** correctly in the pipeline.

##### **1.3 Extract Audio Features using MERT**
For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract MERT-based features from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):

```bash
python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path m-a-p/MERT-v1-95M --mean_features
```
- **Input:** `.mp3`, `.wav`  
- **Output:** `.npy` *(Processed audio features for CLaMP 3)*  

### **Training and Feature Extraction**  

#### **1. Training Models**  
CLaMP 3 is the most powerful music retrieval model, and in most cases, retraining is not needed. However, if necessary, follow these steps.  

1. Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters** and **data paths**.  

2. Train on your own data.

To train CLaMP 3 on **symbolic music** (e.g., sheet music, MIDI), run:  
```bash
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
```
For **audio data**, use:  
```bash
python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
```

##### **Using Pre-Trained Models (Recommended)**  
For most use cases, it's best to use pre-trained weights instead of training from scratch.  

| Version | Best for | Download Link |
|---------|---------|--------------|
| **CLaMP 3 SAAS** | **Audio-based retrieval (Recommended)** | [Download SAAS](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
| **CLaMP 3 C2** | **Symbolic music retrieval (Sheet music, MIDI)** | [Download C2](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |

##### **How to Switch Between Versions?**  
By default, CLaMP 3 is configured for the **SAAS version** (optimized for audio).  
- If working with **symbolic music (MIDI, sheet music)**, use the **C2 version**:  
  **Modify line 66 in `config.py`** from `"saas"` to `"c2"`.
  
#### **2. Feature Extraction**
After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):

```bash
accelerate launch extract_clamp3.py --epoch <epoch> <input_dir> <output_dir> --get_global
```
- **`--epoch <epoch>`:** (Optional) Specify the checkpoint epoch.  
- **`<input_dir>`:** Directory containing the input files.  
- **`<output_dir>`:** Destination folder for the output `.npy` features.  
- **`--get_global`**: **(Required for retrieval!)** Extracts a **global semantic vector** for each input.  

All extracted features are stored as `.npy` files.

> **Note**: For retrieval, `--get_global` must be used. Without it, CLaMP 3 will not work correctly for retrieval tasks. You only omit `--get_global` if you are performing downstream fine-tuning or need raw feature extraction for custom tasks.

## **Citation**
If you find CLaMP 3 useful in your work, please consider citing our paper:

```bibtex
@misc{wu2025clamp3universalmusic,
  title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages}, 
  author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
  year={2025},
  eprint={2502.10362},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2502.10362}
}
```