sander-wood
/

clamp3

Feature Extraction

music

Model card Files Files and versions Community

sander-wood commited on 27 days ago

Commit

9aebcff

verified ·

1 Parent(s): 1c16446

Update README.md

Browse files

Files changed (1) hide show

README.md +64 -142

README.md CHANGED Viewed

@@ -116,39 +116,46 @@ tags:
 </p>
 ## **Overview**
-CLaMP 3 is a multimodal and multilingual framework for music information retrieval (MIR) that supports all major music formats—sheet music, audio, and performance signals—along with multilingual text. It is trained on 27 languages and can generalize to support 100 languages. Using contrastive learning, CLaMP 3 aligns these different formats into a shared representation space, making cross-modal retrieval seamless. Experiments show that it significantly outperforms previous strong baselines, setting a new state-of-the-art in multimodal and multilingual MIR.
-### **Key Features**
-- **Multimodal Support:**
-   - **Sheet Music:** Uses **Interleaved ABC notation**, with a context size of **512 bars**.
-   - **Performance Signals:** Processes **MIDI Text Format (MTF)** data, with a context size of **512 MIDI messages**.
-   - **Audio Recordings:** Works with features extracted by **[MERT](https://arxiv.org/abs/2306.00107)**, with a context size of **640 seconds of audio**.
-- **Multilingual Capabilities:**
-   - Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
-- **Datasets & Benchmarking:**
-   - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A web-scale dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.
-   - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.
-### **What Can CLaMP 3 Do?**
-CLaMP 3 unifies diverse music data and text into a shared representation space, enabling the following key capabilities:
-- **Text-to-Music Retrieval**: Finds relevant music based on text descriptions in 100 languages.
-- **Image-to-Music Retrieval**: Matches music that aligns with the scene depicted in the image.
-- **Cross-Modal Music Retrieval**: Enables music retrieval and recommendation across different modalities.
-- **Zero-Shot Music Classification**: Identifies musical attributes such as genres, moods, and styles without labeled training data.
-- **Music Semantic Similarity Evaluation**: Measures semantic similarity between:
-   - **Generated music and its text prompt**, validating how well text-to-music models follow instructions.
-   - **Generated music and reference music**, assessing their semantic similarity, including aspects like style, instrumentation, and musicality.
-For examples demonstrating these capabilities, visit [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/).
 ## **Quick Start Guide**
-For users who want to get started quickly without delving into the details, follow these steps:
-### **Install Environment**
 ```bash
 conda create -n clamp3 python=3.10.16 -y
 conda activate clamp3
@@ -157,99 +164,33 @@ pip install -r requirements.txt
 ```
 ### **Overview of `clamp3_*.py` Scripts**
-CLaMP 3 provides the `clamp3_*.py` script series for **streamlined data preprocessing, feature extraction, retrieval, similarity computation, and evaluation**. These scripts offer an easy-to-use solution for processing different modalities with minimal configuration.
-**Common Features of `clamp3_*.py` Scripts:**
-- **End-to-End Processing**: Each script handles the entire pipeline in a single command.
-- **Automatic Modality Detection**:
-  Simply specify the file path, and the script will automatically detect the modality (e.g., **audio**, **performance signals**, **sheet music**, **images**, or **text**) and extract the relevant features. Supported formats include:
-  - **Audio**: `.mp3`, `.wav`
-  - **Performance Signals**: `.mid`, `.midi`
-  - **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
-  - **Images**: `.png`, `.jpg`
-  - **Text**: `.txt`
-- **First-Time Model Download**:
-  - The necessary model weights for **[CLaMP 3 (SAAS)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth)**, **[MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)**, and any other required models will be automatically downloaded if needed.
-  - Once downloaded, models are cached and will not be re-downloaded in future runs.
-- **Feature Management**:
-  - Extracted features are saved in `inference/` and **won't be overwritten** to avoid redundant computations.
-  - **To run retrieval on a new dataset**, manually delete the corresponding folder inside `inference/` (e.g., `inference/audio_features/`). Otherwise, previously extracted features will be reused.
-  - Temporary files are stored in `temp/` and **are cleaned up after each run**.
-> **Note**: All files within a folder must belong to the same modality; the script will process them based on the first detected format.
-#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Running Retrieval Tasks**
-This script performs semantic retrieval tasks, comparing a query file to reference files in `ref_dir`. Typically, the larger and more diverse the files in `ref_dir`, the better the chances of finding a semantically matching result.
-```bash
-python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
-```
-- **Text-to-Music Retrieval**:
-  Query is a `.txt` file, and `ref_dir` contains music files. Retrieves the music most semantically similar to the query text.
-- **Image-to-Music Retrieval**:
-  Query is an image (`.png`, `.jpg`), and `ref_dir` contains music files. **BLIP** generates a caption for the image to find the most semantically matching music.
-- **Music-to-Music Retrieval**:
-  Query is a music file, and `ref_dir` contains music files (same or different modality). Supports **cross-modal retrieval** (e.g., retrieving audio using sheet music).
-- **Zero-Shot Classification**:
-  Query is a music file, and `ref_dir` contains **text-based class prototypes** (e.g., `"It is classical"`, `"It is jazz"`). The highest similarity match is the classification result.
-- **Optional `--top_k` Parameter**:
-  You can specify the number of top results to retrieve using the `--top_k` argument. If not provided, the default value is 10.
-  Example:
-  ```bash
-  python clamp3_search.py <query_file> <ref_dir> --top_k 3
-  ```
-  **Example Output**:
-  ```
-  Top 3 results among 1000 candidates:
-  4tDYMayp6Dk 0.7468
-  vGJTaP6anOU 0.7333
-  JkK8g6FMEXE 0.7054
-  ```
 #### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
-This script compares files in a query directory to a reference directory. By default, it uses **group mode**, but you can switch to **pairwise mode** for paired data.
 ```bash
-python clamp3_score.py <query_dir> <ref_dir> [--pairwise]
 ```
-- **Group Mode (default)**:
-  Compares all query files to all reference files and calculates the average similarity. **Use when you don't have paired data** or when dealing with large datasets.
-  **Example**:
-  To compare generated music to ground truth music files (no pairs available), use **group mode**.
-  ```bash
-  python clamp3_score.py query_dir ref_dir
-  ```
-  **Example Output (Group Mode)**:
-  ```
-  Total query features: 1000
-  Total reference features: 1000
-  Group similarity: 0.6711
-  ```
-- **Pairwise Mode**:
-  Compares query files with their corresponding reference files based on **same prefix** (before the dot) and **identical folder structure**. **Use when you have paired data** and the dataset is of manageable size (e.g., thousands of pairs).
-  **Example**:
-  To evaluate a **text-to-music generation model**, where each prompt (e.g., `sample1.txt`) corresponds to one or more generated music files (e.g., `sample1.1.wav`, `sample1.2.wav`), use **pairwise mode**.
-  ```bash
-  python clamp3_score.py query_dir ref_dir --pairwise
-  ```
-  **Folder structure**:
   ```
   query_dir/
   ├── en/
@@ -267,56 +208,37 @@ python clamp3_score.py <query_dir> <ref_dir> [--pairwise]
   │   ├── sample2.txt
   ```
-  - Files with the **same prefix** (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`) are treated as pairs.
-  - Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `query_dir/zh/sample1.txt`).
-  **Example Output (Pairwise Mode)**:
-  ```
-  Total query features: 1000
-  Total reference features: 1000
-  Avg. pairwise similarity: 0.1639
-  ```
-  In **pairwise mode**, the script will additionally output a JSON Lines file (`inference/pairwise_similarities.jsonl`) with the similarity scores for each query-reference pair.
-  For example:
-  ```json
-  {"query": "txt_features/UzUybLGvBxE.npy", "reference": "mid_features/UzUybLGvBxE.npy", "similarity": 0.2289600819349289}
-  ```
-  > **Note**: The file paths in the output will retain the folder structure and file names, but the top-level folder names and file extensions will be replaced.
-#### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Evaluating Retrieval Performance**
-This script evaluates **CLaMP3's retrieval performance on a paired dataset**, measuring how accurately the system ranks the correct reference files for each query using metrics like **MRR** and **Hit@K**.
 ```bash
-python clamp3_eval.py <query_dir> <ref_dir>
 ```
-- **Matching Folder Structure & Filenames**:
-  Requires paired query and reference files, with identical folder structure and filenames between `query_dir` and `ref_dir`. This matches the requirements of **pairwise mode** in `clamp3_score.py`.
-- **Evaluation Metrics**:
-  The script calculates the following retrieval metrics:
-  - **MRR (Mean Reciprocal Rank)**
-  - **Hit@1**, **Hit@10**, and **Hit@100**
-**Example Output**:
-```
-Total query features: 1000
-Total reference features: 1000
-MRR: 0.3301
-Hit@1: 0.251
-Hit@10: 0.482
-Hit@100: 0.796
 ```
-- **Additional Output**:
-  A JSON Lines file (`inference/retrieval_ranks.jsonl`) with query-reference ranks:
-  ```json
-  {"query": "txt_features/HQ9FaXu55l0.npy", "reference": "xml_features/HQ9FaXu55l0.npy", "rank": 6}
-  ```
 ## **Repository Structure**
 - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
 - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.

 </p>
 ## **Overview**
+CLaMP 3 is a **state-of-the-art** framework for **music information retrieval (MIR)** across multiple **modalities** (✍️ **text**, 🎼 **sheet music**, 🎵 **audio**, 🎹 **MIDI**, and 🖼️ **images**) and **languages** (🌐 27 trained, 100 supported). It leverages **contrastive learning** to align diverse music formats into a **shared representation space**, enabling seamless cross-modal retrieval. You can think of it as a more comprehensive version of CLAP or MuLan—with stronger performance, support for all major music modalities, and global language coverage.
+🚀 **Why CLaMP 3?**
+✅ **Multimodal**: Works with ✍️ **text**, 🎼 **sheet music**, 🎵 **audio**, 🎹 **MIDI**, and 🖼️ **images**
+✅ **Multilingual**: Supports **27 trained** & generalizes to **100 languages**
+✅ **SOTA Performance**: Significantly outperforms previous strong baselines across modalities and languages
+## ✨ **Key Features**
+### **Multimodal Support**
+- **Sheet Music**: Interleaved ABC notation (**512 bars**)
+- **Performance Signals**: MIDI Text Format (**512 MIDI messages**)
+- **Audio Recordings**: [MERT](https://arxiv.org/abs/2306.00107) features (**640 sec of audio**)
+### **Multilingual Capabilities**
+- Trained on **27 languages**, generalizes to **100 languages** using [XLM-R](https://arxiv.org/abs/1911.02116)
+### **Visual Semantic Understanding**
+- Learns visual semantics (e.g., image captions) for tasks like **image-to-music retrieval**
+### **Datasets & Benchmarks**
+- **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag)**: **2.31M music-text pairs** 🌎
+- **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x)**: **1,000 music triplets**
+## 🔥 **What Can CLaMP 3 Do?**
+💡 **Text-to-Music Retrieval**: Search music with text (100 languages!)
+📸 **Image-to-Music Retrieval**: Match music to images 🎨
+🔄 **Cross-Modal Retrieval**: Find related music across formats
+🛠️ **Zero-Shot Classification**: Identify genre, mood, & style 🏷️
+🎼 **Semantic Similarity**: Measure similarity between generated & reference music
+👉 **Check it out**: [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/)
 ## **Quick Start Guide**
+For users who want to get started quickly with CLaMP3, follow these steps:
+### **Install the Environment**
+Run the following commands:
 ```bash
 conda create -n clamp3 python=3.10.16 -y
 conda activate clamp3
 ```
 ### **Overview of `clamp3_*.py` Scripts**
+CLaMP 3 provides scripts for **semantic similarity calculation**, **semantic search**, and **retrieval performance evaluation** across five modalities. Simply provide the file path, and the script will automatically detect the modality and extract the relevant features.
+Supported formats include:
+- **Audio**: `.mp3`, `.wav`
+- **Performance Signals**: `.mid`, `.midi`
+- **Sheet Music**: `.mxl`, `.musicxml`, `.xml`
+- **Images**: `.png`, `.jpg`
+- **Text**: `.txt` (in 100 language)
+#### **Feature Management**
+- Extracted features are stored in the `cache/` directory and reused in future runs to avoid recomputation.
+- Temporary files are saved in `temp/` and cleaned up after each run.
+> **Note**: All files in a folder must belong to the same modality for processing.
 #### **[`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_score.py) - Semantic Similarity Calculation**
+This script calculates semantic similarity between query and reference files. By default, it uses **pairwise mode**, but you can switch to **group mode** using the `--group` flag.
 ```bash
+python clamp3_score.py <query_dir> <ref_dir> [--group]
 ```
+- **Pairwise Mode (default)**:
+  Compares files with **matching prefixes** and **identical folder structures**.
+  **Folder structure example**:
   ```
   query_dir/
   ├── en/
   │   ├── sample2.txt
   ```
+  - Files with the **same prefix** (before the first dot) are treated as pairs (e.g., `query_dir/en/sample1.wav` and `ref_dir/en/sample1.txt`).
+  - Multiple query files (e.g., `query_dir/zh/sample1.1.wav`, `query_dir/zh/sample1.2.wav`) can correspond to one reference file (e.g., `ref_dir/zh/sample1.txt`).
+  **Important**:
+  - **Pairwise mode** can be **slow** for large datasets.
+  - If you have a large dataset, **switch to group mode** for faster computation.
+- **Group Mode**:
+  Compares **all query files** to **all reference files** and calculates the average similarity.
+  **Enable Group Mode**:
+  ```bash
+  python clamp3_score.py query_dir ref_dir --group
+  ```
+#### **[`clamp3_search.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_search.py) - Semantic Search**
+Run retrieval tasks by comparing a query file to reference files in `ref_dir`. The query and `ref_dir` can be **any modality**, so there are **25 possible retrieval combinations**, e.g., text-to-music, image-to-text, music-to-music, music-to-text (zero-shot music classification), etc.
 ```bash
+python clamp3_search.py <query_file> <ref_dir> [--top_k TOP_K]
 ```
+#### **[`clamp3_eval.py`](https://github.com/sanderwood/clamp3/blob/main/clamp3_eval.py) - Retrieval Performance Evaluation**
+Evaluates **CLaMP3's retrieval performance** on a paired dataset using metrics like **MRR** and **Hit@K**. Works the same way as **pairwise mode** in `clamp3_score.py`—requiring **matching folder structure** and **filenames** between `query_dir` and `ref_dir`.
+```bash
+python clamp3_eval.py <query_dir> <ref_dir>
 ```
 ## **Repository Structure**
 - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
 - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.