---
license: mit
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
- yue
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: feature-extraction
tags:
- music
---
# **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
[](https://sanderwood.github.io/clamp3/)
[](https://arxiv.org/abs/2502.10362)
[](https://github.com/sanderwood/clamp3)
[](https://huggingface.co/spaces/sander-wood/clamp3)
[](https://huggingface.co/sander-wood/clamp3/tree/main)
[](https://huggingface.co/datasets/sander-wood/m4-rag)
[](https://huggingface.co/datasets/sander-wood/wikimt-x)
## **Overview**
CLaMP 3 is a **state-of-the-art** framework for **music information retrieval (MIR)** across multiple **modalities** (βοΈ **text**, πΌ **sheet music**, π΅ **audio**, πΉ **MIDI**, and πΌοΈ **images**) and **languages** (π 27 trained, 100 supported). It leverages **contrastive learning** to align diverse music modalities into a **shared representation space**, enabling seamless cross-modal retrieval. You can think of it as a more comprehensive version of CLAP or MuLanβwith much stronger performance, support for all major music modalities, and global language coverage.
π **Why CLaMP 3?**
β
**Multimodal**: Works with βοΈ **text**, πΌ **sheet music**, π΅ **audio**, πΉ **MIDI**, and πΌοΈ **images**
β
**Multilingual**: Supports **27 trained** & generalizes to π **100 languages**
β
**SOTA Performance**: Significantly **outperforms previous strong baselines** across modalities and languages π
## β¨ **Key Features**
### **Multimodal Support**
- **Sheet Music**: Interleaved ABC notation (**512 bars**)
- **Performance Signals**: MIDI Text Format (**512 MIDI messages**)
- **Audio Recordings**: [MERT](https://arxiv.org/abs/2306.00107) features (**640 sec of audio**)
### **Multilingual Capabilities**
- Trained on **27 languages**, generalizes to **100 languages** using [XLM-R](https://arxiv.org/abs/1911.02116)
### **Visual Semantic Understanding**
- Learns visual semantics (e.g., image captions) for tasks like **image-to-music retrieval**
### **Datasets & Benchmarks**
- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag)**: **2.31M music-text pairs** π
- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x)**: **1,000 music triplets**
## π₯ **What Can CLaMP 3 Do?**
π‘ **Text-to-Music Retrieval**: Search music with text (100 languages!)
πΈ **Image-to-Music Retrieval**: Match music to images π¨
π **Cross-Modal Retrieval**: Find related music across different modalities
π οΈ **Zero-Shot Classification**: Identify genre, mood, style, & more π·οΈ
πΌ **Semantic Similarity**: Measure semantic similarity between generated & reference music
π **Check it out**: [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/)
## **Quick Start Guide**
For users who want to get started quickly with CLaMP3, follow these steps:
### **Install the Environment**
Run the following commands:
```bash
conda create -n clamp3 python=3.10.16 -y
conda activate clamp3
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install -r requirements.txt
```
### **Overview of `clamp3_*.py` Scripts**
CLaMP 3 provides scripts for **semantic search**, **semantic similarity calculation**, **retrieval performance evaluation**, and **feature extraction** across five modalities. Simply provide the file path, and the script will automatically detect the modality and extract the relevant features.
Supported formats include:
- **Audio**: `.mp3`, `.wav`
- **Performance Signals**: `.mid`, `.midi`
- **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
- **Images**: `.png`, `.jpg`
- **Text**: `.txt` (in 100 languages)
#### **Feature Management**
- Extracted features are stored in the `cache/` directory and reused in future runs to avoid recomputation.
- Temporary files are saved in `temp/` and cleaned up after each run.
> **Note**: All files in a folder must belong to the same modality for processing.
#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Semantic Search**
Run retrieval tasks by comparing a query file to reference files in `ref_dir`. The query and `ref_dir` can be **any modality**, so there are **25 possible retrieval combinations**, e.g., text-to-music, image-to-music, music-to-music, music-to-text (zero-shot music classification), etc.
```bash
python clamp3_search.py [--top_k TOP_K]
```
#### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
This script calculates semantic similarity between query and reference files. By default, it uses **pairwise mode**, but you can switch to **group mode** using the `--group` flag.
```bash
python clamp3_score.py [--group]
```
- **Pairwise Mode (default)**:
Compares files with **matching prefixes** and **identical folder structures**.
**Folder structure example**:
```
query_dir/
βββ en/
β βββ sample1.wav
βββ zh/
β βββ sample1.1.wav
β βββ sample1.2.wav
β βββ sample2.wav
ref_dir/
βββ en/
β βββ sample1.txt
βββ zh/
β βββ sample1.txt
β βββ sample2.txt
```
- Files with the **same prefix** (before the first dot) are treated as pairs (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`).
- Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `ref_dir/zh/sample1.txt`).
**Important**:
- **Pairwise mode** can be **slow** for large datasets.
- If you have a large dataset, **switch to group mode** for faster computation.
- **Group Mode**:
Compares **all query files** to **all reference files** and calculates the average similarity.
**Enable Group Mode**:
```bash
python clamp3_score.py query_dir ref_dir --group
```
#### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Retrieval Performance Evaluation**
Evaluates **CLaMP3's retrieval performance** on a paired dataset using metrics like **MRR** and **Hit@K**. Works the same way as **pairwise mode** in `clamp3_score.py`βrequiring **matching folder structure** and **filenames** between `query_dir` and `ref_dir`.
```bash
python clamp3_eval.py
```
#### **[`clamp3_embd.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_embd.py) - Feature Extraction**
If other scripts don't meet your needs, use `clamp3_embd.py` to extract features.
```bash
python clamp3_embd.py [--get_global]
```
**Feature Output:**
- **Without `--get_global`** β Shape: **(1, T, 768)** (T = time steps). Uses last hidden states before avg pooling, ideal for applications needing temporal info. Fine-tuning recommended.
- **With `--get_global`** β Shape: **(1, 768)**. Uses avg pooled features, suitable for applications needing global info, can be used directly.
## **Repository Structure**
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** β Training & feature extraction scripts.
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** β Linear classification training and prediction.
- **[inference/](https://github.com/sanderwood/clamp3/tree/main/inference)** β Semantic search, similarity calculations, and retrieval evaluation.
- **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** β Convert data into Interleaved ABC, MTF, or MERT-extracted features.
> **Note:** Ensure the model weights are placed in the `code/` folder, and verify the configuration hyperparameters before use.
## **Key Script Overview**
### **Data Preparation**
#### **1. Convert Music Data to Compatible Formats**
Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.
##### **1.1 Convert MusicXML to Interleaved ABC Notation**
CLaMP 3 requires **Interleaved ABC notation** for sheet music. Follow these steps:
1. Convert **MusicXML** (`.mxl`, `.xml`, `.musicxml`) to **standard ABC** using [`batch_xml2abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_xml2abc.py):
```bash
python batch_xml2abc.py
```
- **Input:** Directory containing `.mxl`, `.xml`, `.musicxml` files
- **Output:** Directory where converted `.abc` (Standard ABC) files will be saved
2. Convert **Standard ABC** into **Interleaved ABC** using [`batch_interleaved_abc.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/abc/batch_interleaved_abc.py):
```bash
python batch_interleaved_abc.py
```
- **Input:** Directory containing `.abc` (Standard ABC) files
- **Output:** Directory where Interleaved ABC files will be saved *(for CLaMP 3 use)*
##### **1.2 Convert MIDI to MTF Format**
CLaMP 3 processes performance signals in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):
```bash
python batch_midi2mtf.py --m3_compatible
```
- **Input:** Directory containing `.mid`, `.midi` files
- **Output:** Directory where `.mtf` files will be saved *(MTF format for CLaMP 3)*
- **Important:** The `--m3_compatible` flag **must be included** to ensure the output format is compatible with CLaMP 3. Without this flag, the extracted MTF files **will not work** correctly in the pipeline.
##### **1.3 Extract Audio Features using MERT**
For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract MERT-based features from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):
```bash
python extract_mert.py --input_path --output_path --model_path m-a-p/MERT-v1-95M --mean_features
```
- **Input:** `.mp3`, `.wav`
- **Output:** `.npy` *(Processed audio features for CLaMP 3)*
### **Training and Feature Extraction**
#### **1. Training Models**
CLaMP 3 is the most powerful music retrieval model, and in most cases, retraining is not needed. However, if necessary, follow these steps.
1. Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters** and **data paths**.
2. Train on your own data.
To train CLaMP 3 on **symbolic music** (e.g., sheet music, MIDI), run:
```bash
python -m torch.distributed.launch --nproc_per_node= --use_env train_clamp3_symbolic.py
```
For **audio data**, use:
```bash
python -m torch.distributed.launch --nproc_per_node= --use_env train_clamp3_audio.py
```
##### **Using Pre-Trained Models (Recommended)**
For most use cases, it's best to use pre-trained weights instead of training from scratch.
| Version | Best for | Download Link |
|---------|---------|--------------|
| **CLaMP 3 SAAS** | **Audio-based retrieval (Recommended)** | [Download SAAS](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
| **CLaMP 3 C2** | **Symbolic music retrieval (Sheet music, MIDI)** | [Download C2](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
##### **How to Switch Between Versions?**
By default, CLaMP 3 is configured for the **SAAS version** (optimized for audio).
- If working with **symbolic music (MIDI, sheet music)**, use the **C2 version**:
**Modify line 66 in `config.py`** from `"saas"` to `"c2"`.
#### **2. Feature Extraction**
After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):
```bash
accelerate launch extract_clamp3.py --epoch --get_global
```
- **`--epoch `:** (Optional) Specify the checkpoint epoch.
- **``:** Directory containing the input files.
- **``:** Destination folder for the output `.npy` features.
- **`--get_global`**: **(Required for retrieval!)** Extracts a **global semantic vector** for each input.
All extracted features are stored as `.npy` files.
> **Note**: For retrieval, `--get_global` must be used. Without it, CLaMP 3 will not work correctly for retrieval tasks. You only omit `--get_global` if you are performing downstream fine-tuning or need raw feature extraction for custom tasks.
## **Citation**
If you find CLaMP 3 useful in your work, please consider citing our paper:
```bibtex
@misc{wu2025clamp3universalmusic,
title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages},
author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
year={2025},
eprint={2502.10362},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.10362}
}
```