File size: 10,477 Bytes
9d0d223
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
# MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer

AudioCraft provides the code and models for MAGNeT, [Masked Audio Generation using a Single Non-Autoregressive Transformer][arxiv].

MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions.
It is a masked generative non-autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. 
Unlike prior work on masked generative audio Transformers, such as [SoundStorm](https://arxiv.org/abs/2305.09636) and [VampNet](https://arxiv.org/abs/2307.04686), 
MAGNeT doesn't require semantic token conditioning, model cascading or audio prompting, and employs a full text-to-audio using a single non-autoregressive Transformer.

Check out our [sample page][magnet_samples] or test the available demo!

We use 16K hours of licensed music to train MAGNeT. Specifically, we rely on an internal dataset
of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.


## Model Card

See [the model card](../model_cards/MAGNET_MODEL_CARD.md).


## Installation

Please follow the AudioCraft installation instructions from the [README](../README.md).

AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).

## Usage

We currently offer two ways to interact with MAGNeT:
1. You can use the gradio demo locally by running [`python -m demos.magnet_app --share`](../demos/magnet_app.py).
2. You can play with MAGNeT by running the jupyter notebook at [`demos/magnet_demo.ipynb`](../demos/magnet_demo.ipynb) locally (if you have a GPU).

## API

We provide a simple API and 6 pre-trained models. The pre trained models are:
- `facebook/magnet-small-10secs`: 300M model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-10secs)
- `facebook/magnet-medium-10secs`: 1.5B model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-10secs)
- `facebook/magnet-small-30secs`: 300M model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-30secs)
- `facebook/magnet-medium-30secs`: 1.5B model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-30secs)
- `facebook/audio-magnet-small`: 300M model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-small)
- `facebook/audio-magnet-medium`: 1.5B model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-medium)

In order to use MAGNeT locally **you must have a GPU**. We recommend 16GB of memory, especially for 
the medium size models. 

See after a quick example for using the API.

```python
import torchaudio
from audiocraft.models import MAGNeT
from audiocraft.data.audio import audio_write

model = MAGNeT.get_pretrained('facebook/magnet-small-10secs')
descriptions = ['disco beat', 'energetic EDM', 'funky groove']
wav = model.generate(descriptions)  # generates 3 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
```

## 🤗 Transformers Usage

Coming soon...

## Training

The [MagnetSolver](../audiocraft/solvers/magnet.py) implements MAGNeT's training pipeline.
It defines a masked generation task over multiple streams of discrete tokens
extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md)
for more details on how to train such model).

Note that **we do NOT provide any of the datasets** used for training MAGNeT.
We provide a dummy dataset containing just a few examples for illustrative purposes.

Please read first the [TRAINING documentation](./TRAINING.md), in particular the Environment Setup section.


### Example configurations and grids

We provide configurations to reproduce the released models and our research.
MAGNeT solvers configuration are available in [config/solver/magnet](../config/solver/magnet),
in particular:
* MAGNeT model for text-to-music:
[`solver=magnet/magnet_32khz`](../config/solver/magnet/magnet_32khz.yaml)
* MAGNeT model for text-to-sound:
[`solver=magnet/audio_magnet_16khz`](../config/solver/magnet/audio_magnet_16khz.yaml)

We provide 3 different scales, e.g. `model/lm/model_scale=small` (300M), or `medium` (1.5B), and `large` (3.3B).

Please find some example grids to train MAGNeT at
[audiocraft/grids/magnet](../audiocraft/grids/magnet/).

```shell
# text-to-music
dora grid magnet.magnet_32khz --dry_run --init

# text-to-sound
dora grid magnet.audio_magnet_16khz --dry_run --init

# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup.
```

### dataset and metadata
Learn more in the [datasets section](./DATASETS.md).

#### Music Models
MAGNeT's underlying dataset is an AudioDataset augmented with music-specific metadata.
The MAGNeT dataset implementation expects the metadata to be available as `.json` files
at the same location as the audio files. 

#### Sound Models
Audio-MAGNeT's underlying dataset is an AudioDataset augmented with description metadata.
The Audio-MAGNeT dataset implementation expects the metadata to be available as `.json` files
at the same location as the audio files or through specified external folder.

### Audio tokenizers

See [MusicGen](./MUSICGEN.md)

### Fine tuning existing models

You can initialize your model to one of the pretrained models by using the `continue_from` argument, in particular

```bash
# Using pretrained MAGNeT model.
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//pretrained/facebook/magnet-medium-10secs conditioner=text2music

# Using another model you already trained with a Dora signature SIG.
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//sig/SIG conditioner=text2music

# Or providing manually a path
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=/checkpoints/my_other_xp/checkpoint.th
```

**Warning:** You are responsible for selecting the other parameters accordingly, in a way that make it compatible
    with the model you are fine tuning. Configuration is NOT automatically inherited from the model you continue from. In particular make sure to select the proper `conditioner` and `model/lm/model_scale`.

**Warning:** We currently do not support fine tuning a model with slightly different layers. If you decide
 to change some parts, like the conditioning or some other parts of the model, you are responsible for manually crafting a checkpoint file from which we can safely run `load_state_dict`.
 If you decide to do so, make sure your checkpoint is saved with `torch.save` and contains a dict
    `{'best_state': {'model': model_state_dict_here}}`. Directly give the path to `continue_from` without a `//pretrained/` prefix.

### Evaluation stage
For the 6 pretrained MAGNeT models, objective metrics could be reproduced using the following grids:

```shell
# text-to-music
REGEN=1 dora grid magnet.magnet_pretrained_32khz_eval --dry_run --init

# text-to-sound
REGEN=1 dora grid magnet.audio_magnet_pretrained_16khz_eval --dry_run --init

# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup.
```

See [MusicGen](./MUSICGEN.md) for more details. 

### Generation stage

See [MusicGen](./MUSICGEN.md)

### Playing with the model

Once you have launched some experiments, you can easily get access
to the Solver with the latest trained model using the following snippet.

```python
from audiocraft.solvers.magnet import MagnetSolver

solver = MagnetSolver.get_eval_solver_from_sig('SIG', device='cpu', batch_size=8)
solver.model
solver.dataloaders
```

### Importing / Exporting models

We do not support currently loading a model from the Hugging Face implementation or exporting to it.
If you want to export your model in a way that is compatible with `audiocraft.models.MAGNeT`
API, you can run:

```python
from audiocraft.utils import export
from audiocraft import train
xp = train.main.get_xp_from_sig('SIG_OF_LM')
export.export_lm(xp.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/state_dict.bin')
# You also need to bundle the EnCodec model you used !!
## Case 1) you trained your own
xp_encodec = train.main.get_xp_from_sig('SIG_OF_ENCODEC')
export.export_encodec(xp_encodec.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/compression_state_dict.bin')
## Case 2) you used a pretrained model. Give the name you used without the //pretrained/ prefix.
## This will actually not dump the actual model, simply a pointer to the right model to download.
export.export_pretrained_compression_model('facebook/encodec_32khz', '/checkpoints/my_audio_lm/compression_state_dict.bin')
```

Now you can load your custom model with:
```python
import audiocraft.models
magnet = audiocraft.models.MAGNeT.get_pretrained('/checkpoints/my_audio_lm/')
```


### Learn more

Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).

## FAQ

#### What are top-k, top-p, temperature and classifier-free guidance?

Check out [@FurkanGozukara tutorial](https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Tutorials/AI-Music-Generation-Audiocraft-Tutorial.md#more-info-about-top-k-top-p-temperature-and-classifier-free-guidance-from-chatgpt).

#### Should I use FSDP or autocast ?

The two are mutually exclusive (because FSDP does autocast on its own).
You can use autocast up to 1.5B (medium), if you have enough RAM on your GPU.
FSDP makes everything more complex but will free up some memory for the actual
activations by sharding the optimizer state.

## Citation
```
@misc{ziv2024masked,
      title={Masked Audio Generation using a Single Non-Autoregressive Transformer}, 
      author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi},
      year={2024},
      eprint={2401.04577},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
```

## License

See license information in the [model card](../model_cards/MAGNET_MODEL_CARD.md).

[arxiv]: https://arxiv.org/abs/2401.04577
[magnet_samples]: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/