Feature Extraction
music
sander-wood commited on
Commit
9aebcff
·
verified ·
1 Parent(s): 1c16446

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -142
README.md CHANGED
@@ -116,39 +116,46 @@ tags:
116
  </p>
117
 
118
  ## **Overview**
119
- CLaMP 3 is a multimodal and multilingual framework for music information retrieval (MIR) that supports all major music formats—sheet music, audio, and performance signals—along with multilingual text. It is trained on 27 languages and can generalize to support 100 languages. Using contrastive learning, CLaMP 3 aligns these different formats into a shared representation space, making cross-modal retrieval seamless. Experiments show that it significantly outperforms previous strong baselines, setting a new state-of-the-art in multimodal and multilingual MIR.
120
 
121
- ### **Key Features**
122
- - **Multimodal Support:**
123
- - **Sheet Music:** Uses **Interleaved ABC notation**, with a context size of **512 bars**.
124
- - **Performance Signals:** Processes **MIDI Text Format (MTF)** data, with a context size of **512 MIDI messages**.
125
- - **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**.
126
 
127
- - **Multilingual Capabilities:**
128
- - Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
129
 
130
- - **Datasets & Benchmarking:**
131
- - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A web-scale dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.
132
- - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.
 
133
 
134
- ### **What Can CLaMP 3 Do?**
 
135
 
136
- CLaMP 3 unifies diverse music data and text into a shared representation space, enabling the following key capabilities:
 
137
 
138
- - **Text-to-Music Retrieval**: Finds relevant music based on text descriptions in 100 languages.
139
- - **Image-to-Music Retrieval**: Matches music that aligns with the scene depicted in the image.
140
- - **Cross-Modal Music Retrieval**: Enables music retrieval and recommendation across different modalities.
141
- - **Zero-Shot Music Classification**: Identifies musical attributes such as genres, moods, and styles without labeled training data.
142
- - **Music Semantic Similarity Evaluation**: Measures semantic similarity between:
143
- - **Generated music and its text prompt**, validating how well text-to-music models follow instructions.
144
- - **Generated music and reference music**, assessing their semantic similarity, including aspects like style, instrumentation, and musicality.
145
 
146
- For examples demonstrating these capabilities, visit [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/).
 
 
 
 
 
 
 
 
147
 
148
  ## **Quick Start Guide**
149
- For users who want to get started quickly without delving into the details, follow these steps:
 
 
 
150
 
151
- ### **Install Environment**
152
  ```bash
153
  conda create -n clamp3 python=3.10.16 -y
154
  conda activate clamp3
@@ -157,99 +164,33 @@ pip install -r requirements.txt
157
  ```
158
 
159
  ### **Overview of `clamp3_*.py` Scripts**
160
- CLaMP 3 provides the `clamp3_*.py` script series for **streamlined data preprocessing, feature extraction, retrieval, similarity computation, and evaluation**. These scripts offer an easy-to-use solution for processing different modalities with minimal configuration.
161
-
162
- **Common Features of `clamp3_*.py` Scripts:**
163
- - **End-to-End Processing**: Each script handles the entire pipeline in a single command.
164
- - **Automatic Modality Detection**:
165
- Simply specify the file path, and the script will automatically detect the modality (e.g., **audio**, **performance signals**, **sheet music**, **images**, or **text**) and extract the relevant features. Supported formats include:
166
- - **Audio**: `.mp3`, `.wav`
167
- - **Performance Signals**: `.mid`, `.midi`
168
- - **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
169
- - **Images**: `.png`, `.jpg`
170
- - **Text**: `.txt`
171
- - **First-Time Model Download**:
172
- - The necessary model weights for **[CLaMP 3 (SAAS)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth)**, **[MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)**, and any other required models will be automatically downloaded if needed.
173
- - Once downloaded, models are cached and will not be re-downloaded in future runs.
174
-
175
- - **Feature Management**:
176
- - Extracted features are saved in `inference/` and **won't be overwritten** to avoid redundant computations.
177
- - **To run retrieval on a new dataset**, manually delete the corresponding folder inside `inference/` (e.g., `inference/audio_features/`). Otherwise, previously extracted features will be reused.
178
- - Temporary files are stored in `temp/` and **are cleaned up after each run**.
179
-
180
- > **Note**: All files within a folder must belong to the same modality; the script will process them based on the first detected format.
181
-
182
- #### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Running Retrieval Tasks**
183
-
184
- This script performs semantic retrieval tasks, comparing a query file to reference files in `ref_dir`. Typically, the larger and more diverse the files in `ref_dir`, the better the chances of finding a semantically matching result.
185
-
186
- ```bash
187
- python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
188
- ```
189
-
190
- - **Text-to-Music Retrieval**:
191
- Query is a `.txt` file, and `ref_dir` contains music files. Retrieves the music most semantically similar to the query text.
192
-
193
- - **Image-to-Music Retrieval**:
194
- Query is an image (`.png`, `.jpg`), and `ref_dir` contains music files. **BLIP** generates a caption for the image to find the most semantically matching music.
195
 
196
- - **Music-to-Music Retrieval**:
197
- Query is a music file, and `ref_dir` contains music files (same or different modality). Supports **cross-modal retrieval** (e.g., retrieving audio using sheet music).
 
 
 
 
198
 
199
- - **Zero-Shot Classification**:
200
- Query is a music file, and `ref_dir` contains **text-based class prototypes** (e.g., `"It is classical"`, `"It is jazz"`). The highest similarity match is the classification result.
 
201
 
202
- - **Optional `--top_k` Parameter**:
203
- You can specify the number of top results to retrieve using the `--top_k` argument. If not provided, the default value is 10.
204
- Example:
205
- ```bash
206
- python clamp3_search.py <query_file> <ref_dir> --top_k 3
207
- ```
208
-
209
- **Example Output**:
210
- ```
211
- Top 3 results among 1000 candidates:
212
- 4tDYMayp6Dk 0.7468
213
- vGJTaP6anOU 0.7333
214
- JkK8g6FMEXE 0.7054
215
- ```
216
 
217
  #### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
218
 
219
- This script compares files in a query directory to a reference directory. By default, it uses **group mode**, but you can switch to **pairwise mode** for paired data.
220
 
221
  ```bash
222
- python clamp3_score.py <query_dir> <ref_dir> [--pairwise]
223
  ```
224
 
225
- - **Group Mode (default)**:
226
- Compares all query files to all reference files and calculates the average similarity. **Use when you don't have paired data** or when dealing with large datasets.
227
-
228
- **Example**:
229
- To compare generated music to ground truth music files (no pairs available), use **group mode**.
230
 
231
- ```bash
232
- python clamp3_score.py query_dir ref_dir
233
- ```
234
-
235
- **Example Output (Group Mode)**:
236
- ```
237
- Total query features: 1000
238
- Total reference features: 1000
239
- Group similarity: 0.6711
240
- ```
241
-
242
- - **Pairwise Mode**:
243
- Compares query files with their corresponding reference files based on **same prefix** (before the dot) and **identical folder structure**. **Use when you have paired data** and the dataset is of manageable size (e.g., thousands of pairs).
244
-
245
- **Example**:
246
- To evaluate a **text-to-music generation model**, where each prompt (e.g., `sample1.txt`) corresponds to one or more generated music files (e.g., `sample1.1.wav`, `sample1.2.wav`), use **pairwise mode**.
247
-
248
- ```bash
249
- python clamp3_score.py query_dir ref_dir --pairwise
250
- ```
251
-
252
- **Folder structure**:
253
  ```
254
  query_dir/
255
  ├── en/
@@ -267,56 +208,37 @@ python clamp3_score.py <query_dir> <ref_dir> [--pairwise]
267
  │ ├── sample2.txt
268
  ```
269
 
270
- - Files with the **same prefix** (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`) are treated as pairs.
271
- - Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `query_dir/zh/sample1.txt`).
272
 
273
- **Example Output (Pairwise Mode)**:
274
- ```
275
- Total query features: 1000
276
- Total reference features: 1000
277
- Avg. pairwise similarity: 0.1639
278
- ```
279
 
280
- In **pairwise mode**, the script will additionally output a JSON Lines file (`inference/pairwise_similarities.jsonl`) with the similarity scores for each query-reference pair.
281
- For example:
282
- ```json
283
- {"query": "txt_features/UzUybLGvBxE.npy", "reference": "mid_features/UzUybLGvBxE.npy", "similarity": 0.2289600819349289}
284
- ```
285
 
286
- > **Note**: The file paths in the output will retain the folder structure and file names, but the top-level folder names and file extensions will be replaced.
 
 
 
287
 
288
- #### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Evaluating Retrieval Performance**
289
 
290
- This script evaluates **CLaMP3's retrieval performance on a paired dataset**, measuring how accurately the system ranks the correct reference files for each query using metrics like **MRR** and **Hit@K**.
291
 
292
  ```bash
293
- python clamp3_eval.py <query_dir> <ref_dir>
294
  ```
295
 
296
- - **Matching Folder Structure & Filenames**:
297
- Requires paired query and reference files, with identical folder structure and filenames between `query_dir` and `ref_dir`. This matches the requirements of **pairwise mode** in `clamp3_score.py`.
298
 
299
- - **Evaluation Metrics**:
300
- The script calculates the following retrieval metrics:
301
- - **MRR (Mean Reciprocal Rank)**
302
- - **Hit@1**, **Hit@10**, and **Hit@100**
303
 
304
- **Example Output**:
305
- ```
306
- Total query features: 1000
307
- Total reference features: 1000
308
- MRR: 0.3301
309
- Hit@1: 0.251
310
- Hit@10: 0.482
311
- Hit@100: 0.796
312
  ```
313
 
314
- - **Additional Output**:
315
- A JSON Lines file (`inference/retrieval_ranks.jsonl`) with query-reference ranks:
316
- ```json
317
- {"query": "txt_features/HQ9FaXu55l0.npy", "reference": "xml_features/HQ9FaXu55l0.npy", "rank": 6}
318
- ```
319
-
320
  ## **Repository Structure**
321
  - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
322
  - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
 
116
  </p>
117
 
118
  ## **Overview**
119
+ CLaMP 3 is a **state-of-the-art** framework for **music information retrieval (MIR)** across multiple **modalities** (✍️ **text**, 🎼 **sheet music**, 🎵 **audio**, 🎹 **MIDI**, and 🖼️ **images**) and **languages** (🌐 27 trained, 100 supported). It leverages **contrastive learning** to align diverse music formats into a **shared representation space**, enabling seamless cross-modal retrieval. You can think of it as a more comprehensive version of CLAP or MuLan—with stronger performance, support for all major music modalities, and global language coverage.
120
 
121
+ 🚀 **Why CLaMP 3?**
122
+ **Multimodal**: Works with ✍️ **text**, 🎼 **sheet music**, 🎵 **audio**, 🎹 **MIDI**, and 🖼️ **images**
123
+ **Multilingual**: Supports **27 trained** & generalizes to **100 languages**
124
+ **SOTA Performance**: Significantly outperforms previous strong baselines across modalities and languages
 
125
 
126
+ ## **Key Features**
 
127
 
128
+ ### **Multimodal Support**
129
+ - **Sheet Music**: Interleaved ABC notation (**512 bars**)
130
+ - **Performance Signals**: MIDI Text Format (**512 MIDI messages**)
131
+ - **Audio Recordings**: [MERT](https://arxiv.org/abs/2306.00107) features (**640 sec of audio**)
132
 
133
+ ### **Multilingual Capabilities**
134
+ - Trained on **27 languages**, generalizes to **100 languages** using [XLM-R](https://arxiv.org/abs/1911.02116)
135
 
136
+ ### **Visual Semantic Understanding**
137
+ - Learns visual semantics (e.g., image captions) for tasks like **image-to-music retrieval**
138
 
139
+ ### **Datasets & Benchmarks**
140
+ - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag)**: **2.31M music-text pairs** 🌎
141
+ - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x)**: **1,000 music triplets**
 
 
 
 
142
 
143
+ ## 🔥 **What Can CLaMP 3 Do?**
144
+
145
+ 💡 **Text-to-Music Retrieval**: Search music with text (100 languages!)
146
+ 📸 **Image-to-Music Retrieval**: Match music to images 🎨
147
+ 🔄 **Cross-Modal Retrieval**: Find related music across formats
148
+ 🛠️ **Zero-Shot Classification**: Identify genre, mood, & style 🏷️
149
+ 🎼 **Semantic Similarity**: Measure similarity between generated & reference music
150
+
151
+ 👉 **Check it out**: [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/)
152
 
153
  ## **Quick Start Guide**
154
+ For users who want to get started quickly with CLaMP3, follow these steps:
155
+
156
+ ### **Install the Environment**
157
+ Run the following commands:
158
 
 
159
  ```bash
160
  conda create -n clamp3 python=3.10.16 -y
161
  conda activate clamp3
 
164
  ```
165
 
166
  ### **Overview of `clamp3_*.py` Scripts**
167
+ CLaMP 3 provides scripts for **semantic similarity calculation**, **semantic search**, and **retrieval performance evaluation** across five modalities. Simply provide the file path, and the script will automatically detect the modality and extract the relevant features.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
+ Supported formats include:
170
+ - **Audio**: `.mp3`, `.wav`
171
+ - **Performance Signals**: `.mid`, `.midi`
172
+ - **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
173
+ - **Images**: `.png`, `.jpg`
174
+ - **Text**: `.txt` (in 100 language)
175
 
176
+ #### **Feature Management**
177
+ - Extracted features are stored in the `cache/` directory and reused in future runs to avoid recomputation.
178
+ - Temporary files are saved in `temp/` and cleaned up after each run.
179
 
180
+ > **Note**: All files in a folder must belong to the same modality for processing.
 
 
 
 
 
 
 
 
 
 
 
 
 
181
 
182
  #### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
183
 
184
+ This script calculates semantic similarity between query and reference files. By default, it uses **pairwise mode**, but you can switch to **group mode** using the `--group` flag.
185
 
186
  ```bash
187
+ python clamp3_score.py <query_dir> <ref_dir> [--group]
188
  ```
189
 
190
+ - **Pairwise Mode (default)**:
191
+ Compares files with **matching prefixes** and **identical folder structures**.
 
 
 
192
 
193
+ **Folder structure example**:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  ```
195
  query_dir/
196
  ├── en/
 
208
  │ ├── sample2.txt
209
  ```
210
 
211
+ - Files with the **same prefix** (before the first dot) are treated as pairs (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`).
212
+ - Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `ref_dir/zh/sample1.txt`).
213
 
214
+ **Important**:
215
+ - **Pairwise mode** can be **slow** for large datasets.
216
+ - If you have a large dataset, **switch to group mode** for faster computation.
 
 
 
217
 
218
+ - **Group Mode**:
219
+ Compares **all query files** to **all reference files** and calculates the average similarity.
 
 
 
220
 
221
+ **Enable Group Mode**:
222
+ ```bash
223
+ python clamp3_score.py query_dir ref_dir --group
224
+ ```
225
 
226
+ #### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Semantic Search**
227
 
228
+ Run retrieval tasks by comparing a query file to reference files in `ref_dir`. The query and `ref_dir` can be **any modality**, so there are **25 possible retrieval combinations**, e.g., text-to-music, image-to-text, music-to-music, music-to-text (zero-shot music classification), etc.
229
 
230
  ```bash
231
+ python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
232
  ```
233
 
234
+ #### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Retrieval Performance Evaluation**
 
235
 
236
+ Evaluates **CLaMP3's retrieval performance** on a paired dataset using metrics like **MRR** and **Hit@K**. Works the same way as **pairwise mode** in `clamp3_score.py`—requiring **matching folder structure** and **filenames** between `query_dir` and `ref_dir`.
 
 
 
237
 
238
+ ```bash
239
+ python clamp3_eval.py <query_dir> <ref_dir>
 
 
 
 
 
 
240
  ```
241
 
 
 
 
 
 
 
242
  ## **Repository Structure**
243
  - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
244
  - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.