Update README.md
Browse files
README.md
CHANGED
@@ -116,39 +116,46 @@ tags:
|
|
116 |
</p>
|
117 |
|
118 |
## **Overview**
|
119 |
-
CLaMP 3 is a
|
120 |
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
- **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**.
|
126 |
|
127 |
-
|
128 |
-
- Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
|
129 |
|
130 |
-
|
131 |
-
|
132 |
-
|
|
|
133 |
|
134 |
-
### **
|
|
|
135 |
|
136 |
-
|
|
|
137 |
|
138 |
-
|
139 |
-
- **
|
140 |
-
- **
|
141 |
-
- **Zero-Shot Music Classification**: Identifies musical attributes such as genres, moods, and styles without labeled training data.
|
142 |
-
- **Music Semantic Similarity Evaluation**: Measures semantic similarity between:
|
143 |
-
- **Generated music and its text prompt**, validating how well text-to-music models follow instructions.
|
144 |
-
- **Generated music and reference music**, assessing their semantic similarity, including aspects like style, instrumentation, and musicality.
|
145 |
|
146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
|
148 |
## **Quick Start Guide**
|
149 |
-
For users who want to get started quickly
|
|
|
|
|
|
|
150 |
|
151 |
-
### **Install Environment**
|
152 |
```bash
|
153 |
conda create -n clamp3 python=3.10.16 -y
|
154 |
conda activate clamp3
|
@@ -157,99 +164,33 @@ pip install -r requirements.txt
|
|
157 |
```
|
158 |
|
159 |
### **Overview of `clamp3_*.py` Scripts**
|
160 |
-
CLaMP 3 provides
|
161 |
-
|
162 |
-
**Common Features of `clamp3_*.py` Scripts:**
|
163 |
-
- **End-to-End Processing**: Each script handles the entire pipeline in a single command.
|
164 |
-
- **Automatic Modality Detection**:
|
165 |
-
Simply specify the file path, and the script will automatically detect the modality (e.g., **audio**, **performance signals**, **sheet music**, **images**, or **text**) and extract the relevant features. Supported formats include:
|
166 |
-
- **Audio**: `.mp3`, `.wav`
|
167 |
-
- **Performance Signals**: `.mid`, `.midi`
|
168 |
-
- **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
|
169 |
-
- **Images**: `.png`, `.jpg`
|
170 |
-
- **Text**: `.txt`
|
171 |
-
- **First-Time Model Download**:
|
172 |
-
- The necessary model weights for **[CLaMP 3 (SAAS)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth)**, **[MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)**, and any other required models will be automatically downloaded if needed.
|
173 |
-
- Once downloaded, models are cached and will not be re-downloaded in future runs.
|
174 |
-
|
175 |
-
- **Feature Management**:
|
176 |
-
- Extracted features are saved in `inference/` and **won't be overwritten** to avoid redundant computations.
|
177 |
-
- **To run retrieval on a new dataset**, manually delete the corresponding folder inside `inference/` (e.g., `inference/audio_features/`). Otherwise, previously extracted features will be reused.
|
178 |
-
- Temporary files are stored in `temp/` and **are cleaned up after each run**.
|
179 |
-
|
180 |
-
> **Note**: All files within a folder must belong to the same modality; the script will process them based on the first detected format.
|
181 |
-
|
182 |
-
#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Running Retrieval Tasks**
|
183 |
-
|
184 |
-
This script performs semantic retrieval tasks, comparing a query file to reference files in `ref_dir`. Typically, the larger and more diverse the files in `ref_dir`, the better the chances of finding a semantically matching result.
|
185 |
-
|
186 |
-
```bash
|
187 |
-
python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
|
188 |
-
```
|
189 |
-
|
190 |
-
- **Text-to-Music Retrieval**:
|
191 |
-
Query is a `.txt` file, and `ref_dir` contains music files. Retrieves the music most semantically similar to the query text.
|
192 |
-
|
193 |
-
- **Image-to-Music Retrieval**:
|
194 |
-
Query is an image (`.png`, `.jpg`), and `ref_dir` contains music files. **BLIP** generates a caption for the image to find the most semantically matching music.
|
195 |
|
196 |
-
|
197 |
-
|
|
|
|
|
|
|
|
|
198 |
|
199 |
-
|
200 |
-
|
|
|
201 |
|
202 |
-
|
203 |
-
You can specify the number of top results to retrieve using the `--top_k` argument. If not provided, the default value is 10.
|
204 |
-
Example:
|
205 |
-
```bash
|
206 |
-
python clamp3_search.py <query_file> <ref_dir> --top_k 3
|
207 |
-
```
|
208 |
-
|
209 |
-
**Example Output**:
|
210 |
-
```
|
211 |
-
Top 3 results among 1000 candidates:
|
212 |
-
4tDYMayp6Dk 0.7468
|
213 |
-
vGJTaP6anOU 0.7333
|
214 |
-
JkK8g6FMEXE 0.7054
|
215 |
-
```
|
216 |
|
217 |
#### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
|
218 |
|
219 |
-
This script
|
220 |
|
221 |
```bash
|
222 |
-
python clamp3_score.py <query_dir> <ref_dir> [--
|
223 |
```
|
224 |
|
225 |
-
- **
|
226 |
-
Compares
|
227 |
-
|
228 |
-
**Example**:
|
229 |
-
To compare generated music to ground truth music files (no pairs available), use **group mode**.
|
230 |
|
231 |
-
|
232 |
-
python clamp3_score.py query_dir ref_dir
|
233 |
-
```
|
234 |
-
|
235 |
-
**Example Output (Group Mode)**:
|
236 |
-
```
|
237 |
-
Total query features: 1000
|
238 |
-
Total reference features: 1000
|
239 |
-
Group similarity: 0.6711
|
240 |
-
```
|
241 |
-
|
242 |
-
- **Pairwise Mode**:
|
243 |
-
Compares query files with their corresponding reference files based on **same prefix** (before the dot) and **identical folder structure**. **Use when you have paired data** and the dataset is of manageable size (e.g., thousands of pairs).
|
244 |
-
|
245 |
-
**Example**:
|
246 |
-
To evaluate a **text-to-music generation model**, where each prompt (e.g., `sample1.txt`) corresponds to one or more generated music files (e.g., `sample1.1.wav`, `sample1.2.wav`), use **pairwise mode**.
|
247 |
-
|
248 |
-
```bash
|
249 |
-
python clamp3_score.py query_dir ref_dir --pairwise
|
250 |
-
```
|
251 |
-
|
252 |
-
**Folder structure**:
|
253 |
```
|
254 |
query_dir/
|
255 |
├── en/
|
@@ -267,56 +208,37 @@ python clamp3_score.py <query_dir> <ref_dir> [--pairwise]
|
|
267 |
│ ├── sample2.txt
|
268 |
```
|
269 |
|
270 |
-
- Files with the **same prefix** (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`)
|
271 |
-
- Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `
|
272 |
|
273 |
-
**
|
274 |
-
|
275 |
-
|
276 |
-
Total reference features: 1000
|
277 |
-
Avg. pairwise similarity: 0.1639
|
278 |
-
```
|
279 |
|
280 |
-
|
281 |
-
|
282 |
-
```json
|
283 |
-
{"query": "txt_features/UzUybLGvBxE.npy", "reference": "mid_features/UzUybLGvBxE.npy", "similarity": 0.2289600819349289}
|
284 |
-
```
|
285 |
|
286 |
-
|
|
|
|
|
|
|
287 |
|
288 |
-
#### **[`
|
289 |
|
290 |
-
|
291 |
|
292 |
```bash
|
293 |
-
python
|
294 |
```
|
295 |
|
296 |
-
|
297 |
-
Requires paired query and reference files, with identical folder structure and filenames between `query_dir` and `ref_dir`. This matches the requirements of **pairwise mode** in `clamp3_score.py`.
|
298 |
|
299 |
-
|
300 |
-
The script calculates the following retrieval metrics:
|
301 |
-
- **MRR (Mean Reciprocal Rank)**
|
302 |
-
- **Hit@1**, **Hit@10**, and **Hit@100**
|
303 |
|
304 |
-
|
305 |
-
|
306 |
-
Total query features: 1000
|
307 |
-
Total reference features: 1000
|
308 |
-
MRR: 0.3301
|
309 |
-
Hit@1: 0.251
|
310 |
-
Hit@10: 0.482
|
311 |
-
Hit@100: 0.796
|
312 |
```
|
313 |
|
314 |
-
- **Additional Output**:
|
315 |
-
A JSON Lines file (`inference/retrieval_ranks.jsonl`) with query-reference ranks:
|
316 |
-
```json
|
317 |
-
{"query": "txt_features/HQ9FaXu55l0.npy", "reference": "xml_features/HQ9FaXu55l0.npy", "rank": 6}
|
318 |
-
```
|
319 |
-
|
320 |
## **Repository Structure**
|
321 |
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
|
322 |
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
|
|
|
116 |
</p>
|
117 |
|
118 |
## **Overview**
|
119 |
+
CLaMP 3 is a **state-of-the-art** framework for **music information retrieval (MIR)** across multiple **modalities** (✍️ **text**, 🎼 **sheet music**, 🎵 **audio**, 🎹 **MIDI**, and 🖼️ **images**) and **languages** (🌐 27 trained, 100 supported). It leverages **contrastive learning** to align diverse music formats into a **shared representation space**, enabling seamless cross-modal retrieval. You can think of it as a more comprehensive version of CLAP or MuLan—with stronger performance, support for all major music modalities, and global language coverage.
|
120 |
|
121 |
+
🚀 **Why CLaMP 3?**
|
122 |
+
✅ **Multimodal**: Works with ✍️ **text**, 🎼 **sheet music**, 🎵 **audio**, 🎹 **MIDI**, and 🖼️ **images**
|
123 |
+
✅ **Multilingual**: Supports **27 trained** & generalizes to **100 languages**
|
124 |
+
✅ **SOTA Performance**: Significantly outperforms previous strong baselines across modalities and languages
|
|
|
125 |
|
126 |
+
## ✨ **Key Features**
|
|
|
127 |
|
128 |
+
### **Multimodal Support**
|
129 |
+
- **Sheet Music**: Interleaved ABC notation (**512 bars**)
|
130 |
+
- **Performance Signals**: MIDI Text Format (**512 MIDI messages**)
|
131 |
+
- **Audio Recordings**: [MERT](https://arxiv.org/abs/2306.00107) features (**640 sec of audio**)
|
132 |
|
133 |
+
### **Multilingual Capabilities**
|
134 |
+
- Trained on **27 languages**, generalizes to **100 languages** using [XLM-R](https://arxiv.org/abs/1911.02116)
|
135 |
|
136 |
+
### **Visual Semantic Understanding**
|
137 |
+
- Learns visual semantics (e.g., image captions) for tasks like **image-to-music retrieval**
|
138 |
|
139 |
+
### **Datasets & Benchmarks**
|
140 |
+
- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag)**: **2.31M music-text pairs** 🌎
|
141 |
+
- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x)**: **1,000 music triplets**
|
|
|
|
|
|
|
|
|
142 |
|
143 |
+
## 🔥 **What Can CLaMP 3 Do?**
|
144 |
+
|
145 |
+
💡 **Text-to-Music Retrieval**: Search music with text (100 languages!)
|
146 |
+
📸 **Image-to-Music Retrieval**: Match music to images 🎨
|
147 |
+
🔄 **Cross-Modal Retrieval**: Find related music across formats
|
148 |
+
🛠️ **Zero-Shot Classification**: Identify genre, mood, & style 🏷️
|
149 |
+
🎼 **Semantic Similarity**: Measure similarity between generated & reference music
|
150 |
+
|
151 |
+
👉 **Check it out**: [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/)
|
152 |
|
153 |
## **Quick Start Guide**
|
154 |
+
For users who want to get started quickly with CLaMP3, follow these steps:
|
155 |
+
|
156 |
+
### **Install the Environment**
|
157 |
+
Run the following commands:
|
158 |
|
|
|
159 |
```bash
|
160 |
conda create -n clamp3 python=3.10.16 -y
|
161 |
conda activate clamp3
|
|
|
164 |
```
|
165 |
|
166 |
### **Overview of `clamp3_*.py` Scripts**
|
167 |
+
CLaMP 3 provides scripts for **semantic similarity calculation**, **semantic search**, and **retrieval performance evaluation** across five modalities. Simply provide the file path, and the script will automatically detect the modality and extract the relevant features.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
+
Supported formats include:
|
170 |
+
- **Audio**: `.mp3`, `.wav`
|
171 |
+
- **Performance Signals**: `.mid`, `.midi`
|
172 |
+
- **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
|
173 |
+
- **Images**: `.png`, `.jpg`
|
174 |
+
- **Text**: `.txt` (in 100 language)
|
175 |
|
176 |
+
#### **Feature Management**
|
177 |
+
- Extracted features are stored in the `cache/` directory and reused in future runs to avoid recomputation.
|
178 |
+
- Temporary files are saved in `temp/` and cleaned up after each run.
|
179 |
|
180 |
+
> **Note**: All files in a folder must belong to the same modality for processing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
181 |
|
182 |
#### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
|
183 |
|
184 |
+
This script calculates semantic similarity between query and reference files. By default, it uses **pairwise mode**, but you can switch to **group mode** using the `--group` flag.
|
185 |
|
186 |
```bash
|
187 |
+
python clamp3_score.py <query_dir> <ref_dir> [--group]
|
188 |
```
|
189 |
|
190 |
+
- **Pairwise Mode (default)**:
|
191 |
+
Compares files with **matching prefixes** and **identical folder structures**.
|
|
|
|
|
|
|
192 |
|
193 |
+
**Folder structure example**:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
194 |
```
|
195 |
query_dir/
|
196 |
├── en/
|
|
|
208 |
│ ├── sample2.txt
|
209 |
```
|
210 |
|
211 |
+
- Files with the **same prefix** (before the first dot) are treated as pairs (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`).
|
212 |
+
- Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `ref_dir/zh/sample1.txt`).
|
213 |
|
214 |
+
**Important**:
|
215 |
+
- **Pairwise mode** can be **slow** for large datasets.
|
216 |
+
- If you have a large dataset, **switch to group mode** for faster computation.
|
|
|
|
|
|
|
217 |
|
218 |
+
- **Group Mode**:
|
219 |
+
Compares **all query files** to **all reference files** and calculates the average similarity.
|
|
|
|
|
|
|
220 |
|
221 |
+
**Enable Group Mode**:
|
222 |
+
```bash
|
223 |
+
python clamp3_score.py query_dir ref_dir --group
|
224 |
+
```
|
225 |
|
226 |
+
#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Semantic Search**
|
227 |
|
228 |
+
Run retrieval tasks by comparing a query file to reference files in `ref_dir`. The query and `ref_dir` can be **any modality**, so there are **25 possible retrieval combinations**, e.g., text-to-music, image-to-text, music-to-music, music-to-text (zero-shot music classification), etc.
|
229 |
|
230 |
```bash
|
231 |
+
python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
|
232 |
```
|
233 |
|
234 |
+
#### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Retrieval Performance Evaluation**
|
|
|
235 |
|
236 |
+
Evaluates **CLaMP3's retrieval performance** on a paired dataset using metrics like **MRR** and **Hit@K**. Works the same way as **pairwise mode** in `clamp3_score.py`—requiring **matching folder structure** and **filenames** between `query_dir` and `ref_dir`.
|
|
|
|
|
|
|
237 |
|
238 |
+
```bash
|
239 |
+
python clamp3_eval.py <query_dir> <ref_dir>
|
|
|
|
|
|
|
|
|
|
|
|
|
240 |
```
|
241 |
|
|
|
|
|
|
|
|
|
|
|
|
|
242 |
## **Repository Structure**
|
243 |
- **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
|
244 |
- **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
|