File size: 9,029 Bytes
2631d60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# MusicGen-Style: Audio Conditioning for Music Generation via Discrete Bottleneck Features

AudioCraft provides the code and models for MusicGen-Style, [Audio Conditioning for Music Generation via Discrete Bottleneck Features][arxiv].

MusicGen-Style is a text-and-audio-to-music model that can be conditioned on textual and audio data (thanks to a style conditioner). 
The style conditioner takes as input a music excerpt of a few seconds (between 1.5 and 4.5) extracts some features that are used by the model to generate music in the same style. 
This style conditioning can be mixed with textual description. 

Check out our [sample page][musicgen_style_samples] or test the available demo!

We use 16K hours of licensed music to train MusicGen-Style. Specifically, we rely on an internal dataset
of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.


## Model Card

See [the model card](../model_cards/MUSICGEN_STYLE_MODEL_CARD.md).


## Installation

Please follow the AudioCraft installation instructions from the [README](../README.md).

MusicGen-Stem requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).

## Usage

1. You can play with MusicGen-Style by running the jupyter notebook at [`demos/musicgen_style_demo.ipynb`](../demos/musicgen_style_demo.ipynb) locally (if you have a GPU).
2. You can use the gradio demo locally by running python -m demos.musicgen_style_app --share.
3. You can play with MusicGen by running the jupyter notebook at demos/musicgen_style_demo.ipynb locally (if you have a GPU).

## API

We provide a simple API 1 pre-trained model with MERT used as a feature extractor for the style conditioner:
- `facebook/musicgen-style`: medium (1.5B) MusicGen model, text and style to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/musicgen-style)

In order to use MusicGen-Style locally **you must have a GPU**. We recommend 16GB of memory. 

See after a quick example for using the API.

To perform text-to-music:
```python

import torchaudio

from audiocraft.models import MusicGen

from audiocraft.data.audio import audio_write



model = MusicGen.get_pretrained('facebook/musicgen-style')





model.set_generation_params(

    duration=8, # generate 8 seconds, can go up to 30

    use_sampling=True, 

    top_k=250,

    cfg_coef=3., # Classifier Free Guidance coefficient 

    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning

)  



descriptions = ['disco beat', 'energetic EDM', 'funky groove']

wav = model.generate(descriptions)  # generates 3 samples.



for idx, one_wav in enumerate(wav):

    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.

    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

```

To perform style-to-music:
```python

import torchaudio

from audiocraft.models import MusicGen

from audiocraft.data.audio import audio_write



model = MusicGen.get_pretrained('facebook/musicgen-style')





model.set_generation_params(

    duration=8, # generate 8 seconds, can go up to 30

    use_sampling=True, 

    top_k=250,

    cfg_coef=3., # Classifier Free Guidance coefficient 

    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning

)



model.set_style_conditioner_params(

    eval_q=1, # integer between 1 and 6

              # eval_q is the level of quantization that passes

              # through the conditioner. When low, the models adheres less to the 

              # audio conditioning

    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt

    )



melody, sr = torchaudio.load('./assets/electronic.mp3')





wav = model.generate_with_chroma(descriptions=[None, None, None], 

                melody[None].expand(3, -1, -1), sr)  # generates 3 samples.



for idx, one_wav in enumerate(wav):

    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.

    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

```

To perform style-and-text-to-music:
```python

import torchaudio

from audiocraft.models import MusicGen

from audiocraft.data.audio import audio_write



model = MusicGen.get_pretrained('facebook/musicgen-style')





model.set_generation_params(

    duration=8, # generate 8 seconds, can go up to 30

    use_sampling=True, 

    top_k=250,

    cfg_coef=3., # Classifier Free Guidance coefficient 

    cfg_coef_beta=5., # double CFG is necessary for text-and-style conditioning

                   # Beta in the double CFG formula. between 1 and 9. When set to 1 it is equivalent to normal CFG. 

                   # When we increase this parameter, the text condition is pushed. See the bottom of https://musicgenstyle.github.io/ 

                   # to better understand the effects of the double CFG coefficients. 

)



model.set_style_conditioner_params(

    eval_q=1, # integer between 1 and 6

              # eval_q is the level of quantization that passes

              # through the conditioner. When low, the models adheres less to the 

              # audio conditioning

    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt, can be                 

                       # between 1.5 and 4.5 seconds but it has to be shortest to the length of the provided conditioning

    )



melody, sr = torchaudio.load('./assets/electronic.mp3')



descriptions = ["8-bit old video game music", "Chill lofi remix", "80s New wave with synthesizer"]

wav = model.generate_with_chroma(descriptions=["8-bit old video game music"], 

                melody[None].expand(3, -1, -1), sr)  # generates 3 samples.



for idx, one_wav in enumerate(wav):

    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.

    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

```


## Training
To train MusicGen-Style, we use the [MusicGenSolver](../audiocraft/solvers/musicgen.py).

Note that **we do NOT provide any of the datasets** used for training MusicGen-Style.
We provide a dummy dataset containing just a few examples for illustrative purposes.

Please read first the [TRAINING documentation](./TRAINING.md), in particular the Environment Setup section.


### Example configurations and grids

We provide the configuration to reproduce the training of MusicGen-Style in [config/solver/musicgen/musicgen_style_32khz.yaml](../config/solver/musicgen/musicgen_style_32khz.yaml),

In particular, the conditioner configuration is provided in [/config/conditioner/style2music.yaml](../config/conditioner/style2music.yaml).

The grid to train the model is 
[audiocraft/grids/musicgen/musicgen_style_32khz.py](../audiocraft/grids/musicgen/musicgen_style_32khz.py).

```shell

# text-and-style-to-music

dora grid musicgen.musicgen_style_32khz --dry_run --init



# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup.

```

### dataset and metadata
Learn more in the [datasets section](./DATASETS.md).

### Audio tokenizers

See [MusicGen](./MUSICGEN.md)

### Fine tuning existing models

You can initialize your model to one of the pretrained models by using the `continue_from` argument, in particular

```bash

# Using pretrained MusicGen-Style model.

dora run solver=musicgen/musicgen_style_32khz model/lm/model_scale=medium continue_from=//pretrained/facebook/musicgen-style conditioner=style2music



# Using another model you already trained with a Dora signature SIG.

dora run solver=musicgen/musicgen_style_32khz model/lm/model_scale=medium continue_from=//sig/SIG conditioner=style2music



# Or providing manually a path

dora run solver=musicgen/musicgen_style_32khz model/lm/model_scale=medium continue_from=/checkpoints/my_other_xp/checkpoint.th

```

**Warning:** You are responsible for selecting the other parameters accordingly, in a way that make it compatible
    with the model you are fine tuning. Configuration is NOT automatically inherited from the model you continue from. In particular make sure to select the proper `conditioner` and `model/lm/model_scale`.


**Warning:** We currently do not support fine tuning a model with slightly different layers. If you decide
 to change some parts, like the conditioning or some other parts of the model, you are responsible for manually crafting a checkpoint file from which we can safely run `load_state_dict`.
 If you decide to do so, make sure your checkpoint is saved with `torch.save` and contains a dict
    `{'best_state': {'model': model_state_dict_here}}`. Directly give the path to `continue_from` without a `//pretrained/` prefix.



[arxiv]: https://arxiv.org/abs/2407.12563
[musicgen_samples]: https://musicgenstyle.github.io/