ta012 commited on
Commit
30136f6
·
verified ·
1 Parent(s): dbdab00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -3
README.md CHANGED
@@ -1,3 +1,161 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🔊 [ICLR 2025] SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
2
+
3
+ [![Conference Paper](https://img.shields.io/badge/ICLR-2025-blue)](https://openreview.net/forum?id=odU59TxdiB)
4
+
5
+ 🚀 **SSLAM** is a self-supervised learning framework designed to enhance audio representation quality for both **polyphonic(multiple overlapping sounds)** and monophonic soundscapes. Unlike traditional SSL models that focus on monophonic data, SSLAM introduces a novel **source retention loss** and **audio mixture training**, significantly improving performance on real-world polyphonic audio.
6
+
7
+ 🔗 **[Github](https://github.com/ta012/SSLAM) | **[Paper](https://openreview.net/pdf?id=odU59TxdiB) | [Open Review](https://openreview.net/forum?id=odU59TxdiB) | [Pre-trained Models](https://drive.google.com/drive/folders/1G0icv-hdqDEqnfP4EFszMXhFnWWM09gT?usp=sharing)**
8
+
9
+ ---
10
+
11
+ # 📋 Table of Contents
12
+ - [Why SSLAM?](#why-sslam)
13
+ - [Key Features](#key-features)
14
+ - [Results](#results)
15
+ - [Inference Mode](#️inference-mode)
16
+ - [Inference Installation](#inference-installation)
17
+ - [Model Weights](#model-weights)
18
+ - [Using SSLAM](#using-sslam)
19
+ - [Training Mode](#training-mode)
20
+ - [Training Installation](#training-installation)
21
+ - [Data Preparation](#️data-preparation)
22
+ - [Pre-Training](#pre-training)
23
+ - [Checklist](#checklist)
24
+ - [Acknowledgements](#acknowledgements)
25
+ - [Citation](#citation)
26
+
27
+ ---
28
+
29
+ ## 🔍Why SSLAM?
30
+ 🔊 **Real-world audio is polyphonic**—multiple overlapping sound sources are common in everyday environments.
31
+ ❌ **Existing SSL models focus on monophonic audio,** limiting their ability to generalize to real-world scenarios. Their benchmarks are primarily monophonic, and their pre-training does not account for polyphonic environments.
32
+ 💡 **SSLAM bridges this gap** by introducing **self-supervised learning from audio mixtures**, enabling robust learning across **both monophonic and polyphonic soundscapes**.
33
+
34
+ ---
35
+
36
+ ## 🎼Key Features
37
+ ✅ **Self-Supervised Learning from Audio Mixtures (SSLAM)** – improving robustness to real-world polyphonic audio (multiple overlapping sounds).
38
+ ✅ **Source Retention Loss** – ensures the integrity of each sound source even in complex mixtures.
39
+ ✅ **SOTA Performance** – Achieves **+3.9% mAP improvement** on AudioSet-2M and **+9.1% on polyphonic datasets**.
40
+
41
+ ---
42
+ ## 📊Results
43
+
44
+ ### 1. Standard Audio-SSL Benchmark Datasets
45
+ ![Standard Audio-SSL Benchmark](assets/as2m_results.png)
46
+
47
+ ### 2. Polyphonic Datasets
48
+ ![Polyphonic Datasets](assets/poly_results.png)
49
+
50
+ ---
51
+ ## **🔍️Inference Mode**
52
+ > **Note**: If you are already using [EAT](https://github.com/cwx-worst-one/EAT/tree/main) in your evaluation/inference pipeline, you can simply replace the weights with SSLAM weights, as the inference and evaluation code is identical to EAT.
53
+
54
+ If not, follow the steps below for installation:
55
+ ## 📥Inference Installation
56
+
57
+ ```bash
58
+ conda create --prefix /path/to/sslam_eval_env -y python=3.9.13
59
+ /path/to/sslam_eval_env/bin/python -m pip install pip==24.0 # downgrade pip
60
+ /path/to/sslam_eval_env/bin/pip install -r SSLAM_Inference/requirements_sslam_eval.txt
61
+ ```
62
+ ---
63
+
64
+ ## 📦Model Weights
65
+
66
+ | Model Type | Link |
67
+ |--------------------------|--------------------------------------------------------------------------------------------|
68
+ | **Pre-Trained** | [Download](https://drive.google.com/drive/folders/1aA65-qQCHSCrkiDeLGUtn1PiEjJi5HS8?usp=sharing) |
69
+ | **AS2M Fine-Tuned** (50.2 mAP) | [Download](https://drive.google.com/drive/folders/1Yy38IyksON5RJFNM7gzeQoAOSPnEIKp2?usp=sharing) |
70
+ ---
71
+
72
+ #### 🚀**Using SSLAM**
73
+
74
+ We provide scripts to use SSLAM in the following ways:
75
+
76
+ ##### 1. **Audio Feature (Representation) Extraction Using SSLAM Encoder**
77
+
78
+ ```bash
79
+ cd SSLAM_Inference/scripts
80
+ bash feature_extract.sh
81
+ ```
82
+
83
+ ##### 2. **Inference on Single Audio WAV File**
84
+
85
+ ```bash
86
+ cd SSLAM_Inference/scripts
87
+ bash inference.sh
88
+ ```
89
+
90
+ ##### 3. **Evaluation on AudioSet-2M Evaluation Set**
91
+
92
+ ```bash
93
+ cd SSLAM_Inference/scripts
94
+ bash evaluate_AS2M_finetuned.sh # Reported mAP: 50.2
95
+ ```
96
+
97
+ ---
98
+ ## **📈Training Mode**
99
+ We cover the self-supervised pre-training, fine-tuning and linear evaluation under this section.
100
+
101
+ #### **📥Training Installation**
102
+
103
+ For training its better to install the fairseq in editable mode,
104
+
105
+ ```bash
106
+ conda create --prefix /path/to/sslam_env -y python=3.9.13 ## env used for training
107
+ /path/to/sslam_env/bin/python -m pip install pip==24.0 # downgrade pip
108
+ cd SSLAM/
109
+ git clone https://github.com/facebookresearch/fairseq.git
110
+
111
+ ##IMPORTANT: Copy the Pre-Training/SSLAM_Stage2 directory to SSLAM/fairseq
112
+ ## so that the resultant path is SSLAM/fairseq/SSLAM_Stage2/.
113
+ cd fairseq/
114
+
115
+ ## install all requirements apart from fairseq
116
+ /path/to/sslam_env/bin/pip install -r SSLAM_Stage2/requirements_sslam.txt
117
+ ## install fairseq in editable mode
118
+ /path/to/sslam_env/bin/pip install --editable ./
119
+ ```
120
+ #### 🗄️Data Preparation
121
+ We utilised AudioSet-2M (full set) for pre-training. For this phase, only the `train.tsv` file is required. Refer to [train.tsv for AudioSet-20K](data_manifests/manifest_as20k/train.tsv) to prepare the train.tsv file for your downloaded copy of AudioSet-2M.
122
+
123
+ #### 🚀Pre-Training
124
+
125
+ **Note:** This repository focuses solely on Stage 2 pre-training, which introduces our novel SSLAM pre-training strategy.
126
+
127
+ To begin Stage 2, you'll need a Stage 1 checkpoint. In our complete pre-training process, Stage 1 mirrors the approach in [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and achieves similar performance. For convenience, we use the EAT checkpoint as the Stage 1 checkpoint.
128
+
129
+ Download the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) epoch 10 checkpoint using the link provided by the [EAT](https://github.com/cwx-worst-one/EAT/tree/main) repository: [EAT-base_epoch10_pt.pt](https://drive.google.com/file/d/10pklbY_fKraQUIBizSg1kv4lJXNWxpxl/view?usp=sharing).
130
+
131
+ *Only the contents of the **models/** folder and a few parameters in the pre-training script differ between Stage 1 and Stage 2.*
132
+
133
+ ```bash
134
+ cd SSLAM/fairseq/SSLAM_Stage2/scripts/
135
+ bash pretrain_stage2.sh
136
+ ```
137
+
138
+
139
+ ## 📌Checklist
140
+ - [x] Inference Mode
141
+ - [x] Pre-Training
142
+
143
+ ---
144
+
145
+ ## 🙏Acknowledgements
146
+
147
+ Our code is primarily based on [EAT](https://github.com/cwx-worst-one/EAT/tree/main) and [data2vec 2.0](https://github.com/facebookresearch/fairseq/tree/main/examples/data2vec) with additional concepts and components adapted from [AudioMAE](https://github.com/facebookresearch/AudioMAE).
148
+
149
+
150
+ ## 📜Citation
151
+
152
+ If you find our work useful, please cite it as:
153
+
154
+ ```bibtex
155
+ @inproceedings{alex2025sslam,
156
+ title={{SSLAM}: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes},
157
+ author={Tony Alex and Sara Atito and Armin Mustafa and Muhammad Awais and Philip J B Jackson},
158
+ booktitle={The Thirteenth International Conference on Learning Representations},
159
+ year={2025},
160
+ url={https://openreview.net/forum?id=odU59TxdiB}
161
+ }