Initial commit
Browse files- README copy.md +0 -141
- README.md +141 -3
README copy.md
DELETED
@@ -1,141 +0,0 @@
|
|
1 |
-
# Musimple:Text2Music with DiT Made simple
|
2 |
-
|
3 |
-
## Introduction
|
4 |
-
|
5 |
-
This repository provides a simple and clear implementation of a **Text-to-Music Generation** pipeline using a **DiT (Diffusion Transformer)** model. The codebase includes key components such as **model training**, **inference**, and **evaluation**. We use the **GTZAN dataset** as an example to demonstrate a minimal, working pipeline for text-conditioned music generation.
|
6 |
-
|
7 |
-
The repository is designed to be easy to use and customize, making it simple to reproduce our results on a single **NVIDIA RTX 4090 GPU**. Additionally, the code is structured to be flexible, allowing you to modify it for your own tasks and datasets.
|
8 |
-
|
9 |
-
We plan to continue maintaining and improving this repository with new features, model improvements, and extended documentation in the future.
|
10 |
-
|
11 |
-
## Features
|
12 |
-
|
13 |
-
- **Text-to-Music Generation**: Generate music directly from text descriptions using a DiT model.
|
14 |
-
- **GTZAN Example**: A simple pipeline using the GTZAN dataset to demonstrate the workflow.
|
15 |
-
- **End-to-End Pipeline**: Includes model training, inference, and evaluation with support for generating audio files.
|
16 |
-
- **Customizable**: Easy to modify and extend for different datasets or use cases.
|
17 |
-
- **Single GPU Training**: Optimized for training on a single RTX 4090 GPU but adaptable to different hardware setups.
|
18 |
-
|
19 |
-
## Requirements
|
20 |
-
|
21 |
-
Before using the code, ensure that the following dependencies are installed:
|
22 |
-
|
23 |
-
- Python >= 3.9
|
24 |
-
- CUDA (if available)
|
25 |
-
- Required Python libraries from `requirements.txt`
|
26 |
-
|
27 |
-
You can install the dependencies using:
|
28 |
-
|
29 |
-
```bash
|
30 |
-
conda create -n musimple python=3.9
|
31 |
-
conda activate musimple
|
32 |
-
pip install -r requirements.txt
|
33 |
-
```
|
34 |
-
|
35 |
-
|
36 |
-
## Data Preprocessing
|
37 |
-
|
38 |
-
To begin with, you will need to download the **GTZAN dataset**. Once downloaded, you can use the `gtzan_split.py` script located in the `tools` directory to split the dataset into training and testing sets. Run the following command:
|
39 |
-
|
40 |
-
```bash
|
41 |
-
python gtzan_split.py --root_dir /path/to/gtzan/genres --output_dir /path/to/output/directory
|
42 |
-
```
|
43 |
-
|
44 |
-
Next, convert the audio files into an HDF5 format using the gtzan2h5.py script:
|
45 |
-
|
46 |
-
```bash
|
47 |
-
python gtzan2h5.py --root_dir /path/to/audio/files --output_h5_file /path/to/output.h5 --config_path bigvgan_v2_22khz_80band_256x/config.json --sr 22050
|
48 |
-
```
|
49 |
-
|
50 |
-
Preprocessed Data
|
51 |
-
If this process seems cumbersome, don’t worry! **We have already preprocessed the dataset**, and you can find it in the **musimple/dataset** directory. You can download and use this data directly to skip the preprocessing steps.
|
52 |
-
|
53 |
-
Data Breakdown
|
54 |
-
In this preprocessing stage, there are two main parts:
|
55 |
-
|
56 |
-
Text to Latent Transformation: We use a Sentence Transformer to convert text labels into latent representations.
|
57 |
-
Audio to Mel Spectrogram: The original audio files are converted into mel spectrograms.
|
58 |
-
Both the latent representations and mel spectrograms are stored in an HDF5 file, making them easily accessible during training and inference.
|
59 |
-
|
60 |
-
## Training
|
61 |
-
|
62 |
-
To begin training, simply navigate to the `Musimple` directory and run the following command:
|
63 |
-
|
64 |
-
```bash
|
65 |
-
cd Musimple
|
66 |
-
python train.py
|
67 |
-
```
|
68 |
-
|
69 |
-
Configurable Parameters
|
70 |
-
All training-related parameters can be adjusted in the configuration file located at:
|
71 |
-
```
|
72 |
-
./config/train.yaml
|
73 |
-
```
|
74 |
-
This allows you to easily modify aspects like the learning rate, batch size, number of epochs, and more to suit your hardware or dataset requirements.
|
75 |
-
|
76 |
-
We also provide a **pre-trained checkpoint** trained for two days on a single **NVIDIA RTX 4090**. You can use this checkpoint for inference or fine-tuning. The key training parameters for this checkpoint are as follows:
|
77 |
-
|
78 |
-
- `batch_size`: 48
|
79 |
-
- `mel_frames`: 800
|
80 |
-
- `lr`: 0.0001
|
81 |
-
- `num_epochs`: 100000
|
82 |
-
- `sample_interval`: 250
|
83 |
-
- `h5_file_path`: './dataset/gtzan_train.h5'
|
84 |
-
- `device`: 'cuda:4'
|
85 |
-
- `input_size`: [80, 800]
|
86 |
-
- `patch_size`: 8
|
87 |
-
- `in_channels`: 1
|
88 |
-
- `hidden_size`: 384
|
89 |
-
- `depth`: 12
|
90 |
-
- `num_heads`: 6
|
91 |
-
- `checkpoint_dir`: 'gtzan-ck'
|
92 |
-
|
93 |
-
You can modify the model architecture and parameters in the `train.yaml` configuration file to compare your models against ours. We will continue to release more checkpoints and models in future updates.
|
94 |
-
|
95 |
-
## Inference
|
96 |
-
|
97 |
-
Once you have trained your own model, you can perform inference using the trained model. To do so, run the following command:
|
98 |
-
|
99 |
-
```bash
|
100 |
-
python sample.py --checkpoint ./gtzan-ck/model_epoch_20000.pt \
|
101 |
-
--h5_file ./dataset/gtzan_test.h5 \
|
102 |
-
--output_gt_dir ./sample/gt \
|
103 |
-
--output_gen_dir ./sample/gn \
|
104 |
-
--segment_length 800 \
|
105 |
-
--sample_rate 22050
|
106 |
-
```
|
107 |
-
You can also try running inference using our pre-trained model to familiarize yourself with the inference process. We have saved some inference results in the sample folder as a demo. However, due to the limited size of our model, the generated results are not of the highest quality and are intended as simple examples to guide further evaluation.
|
108 |
-
|
109 |
-
## Evaluation
|
110 |
-
|
111 |
-
For the evaluation phase, we highly recommend creating a new environment and using the evaluation library available at [Generated Music Evaluation](https://github.com/HarlandZZC/generated_music_evaluation). This repository provides detailed instructions on setting up the environment and how to use the evaluation tools. New features and functionality will be added to this library over time.
|
112 |
-
|
113 |
-
Once you have set up the environment following the instructions from the evaluation repository, you can run the following script to evaluate your generated music:
|
114 |
-
|
115 |
-
```bash
|
116 |
-
python eval.py \
|
117 |
-
--ref_path ../sample/gt \
|
118 |
-
--gen_path ../sample/gn \
|
119 |
-
--id2text_csv_path ../gtzan-test.csv \
|
120 |
-
--output_path ./output \
|
121 |
-
--device_id 0 \
|
122 |
-
--batch_size 32 \
|
123 |
-
--original_sample_rate 24000 \
|
124 |
-
--fad_sample_rate 16000 \
|
125 |
-
--kl_sample_rate 16000 \
|
126 |
-
--clap_sample_rate 48000 \
|
127 |
-
--run_fad 1 \
|
128 |
-
--run_kl 1 \
|
129 |
-
--run_clap 1
|
130 |
-
```
|
131 |
-
|
132 |
-
This script evaluates the generated music against reference music, producing evaluation metrics such as CLAP, KL, and FAD scores.
|
133 |
-
|
134 |
-
## To-Do
|
135 |
-
|
136 |
-
The following features and improvements are planned for future updates:
|
137 |
-
|
138 |
-
- **EMA Model**: Implement Exponential Moving Average (EMA) for model weights to stabilize training and improve final generation quality.
|
139 |
-
- **Long-Term Music Fine-tuning**: Explore fine-tuning the model to generate longer-term music with more coherent structures.
|
140 |
-
- **VAE Integration**: Integrate a Variational Autoencoder (VAE) to improve latent space representations and potentially enhance generation diversity.
|
141 |
-
- **T5-based Text Conditioning**: Add T5 to enhance text conditioning, improving the control and accuracy of the text-to-music generation process.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
@@ -1,3 +1,141 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Musimple:Text2Music with DiT Made simple
|
2 |
+
|
3 |
+
## Introduction
|
4 |
+
|
5 |
+
This repository provides a simple and clear implementation of a **Text-to-Music Generation** pipeline using a **DiT (Diffusion Transformer)** model. The codebase includes key components such as **model training**, **inference**, and **evaluation**. We use the **GTZAN dataset** as an example to demonstrate a minimal, working pipeline for text-conditioned music generation.
|
6 |
+
|
7 |
+
The repository is designed to be easy to use and customize, making it simple to reproduce our results on a single **NVIDIA RTX 4090 GPU**. Additionally, the code is structured to be flexible, allowing you to modify it for your own tasks and datasets.
|
8 |
+
|
9 |
+
We plan to continue maintaining and improving this repository with new features, model improvements, and extended documentation in the future.
|
10 |
+
|
11 |
+
## Features
|
12 |
+
|
13 |
+
- **Text-to-Music Generation**: Generate music directly from text descriptions using a DiT model.
|
14 |
+
- **GTZAN Example**: A simple pipeline using the GTZAN dataset to demonstrate the workflow.
|
15 |
+
- **End-to-End Pipeline**: Includes model training, inference, and evaluation with support for generating audio files.
|
16 |
+
- **Customizable**: Easy to modify and extend for different datasets or use cases.
|
17 |
+
- **Single GPU Training**: Optimized for training on a single RTX 4090 GPU but adaptable to different hardware setups.
|
18 |
+
|
19 |
+
## Requirements
|
20 |
+
|
21 |
+
Before using the code, ensure that the following dependencies are installed:
|
22 |
+
|
23 |
+
- Python >= 3.9
|
24 |
+
- CUDA (if available)
|
25 |
+
- Required Python libraries from `requirements.txt`
|
26 |
+
|
27 |
+
You can install the dependencies using:
|
28 |
+
|
29 |
+
```bash
|
30 |
+
conda create -n musimple python=3.9
|
31 |
+
conda activate musimple
|
32 |
+
pip install -r requirements.txt
|
33 |
+
```
|
34 |
+
|
35 |
+
|
36 |
+
## Data Preprocessing
|
37 |
+
|
38 |
+
To begin with, you will need to download the **GTZAN dataset**. Once downloaded, you can use the `gtzan_split.py` script located in the `tools` directory to split the dataset into training and testing sets. Run the following command:
|
39 |
+
|
40 |
+
```bash
|
41 |
+
python gtzan_split.py --root_dir /path/to/gtzan/genres --output_dir /path/to/output/directory
|
42 |
+
```
|
43 |
+
|
44 |
+
Next, convert the audio files into an HDF5 format using the gtzan2h5.py script:
|
45 |
+
|
46 |
+
```bash
|
47 |
+
python gtzan2h5.py --root_dir /path/to/audio/files --output_h5_file /path/to/output.h5 --config_path bigvgan_v2_22khz_80band_256x/config.json --sr 22050
|
48 |
+
```
|
49 |
+
|
50 |
+
Preprocessed Data
|
51 |
+
If this process seems cumbersome, don’t worry! **We have already preprocessed the dataset**, and you can find it in the **musimple/dataset** directory. You can download and use this data directly to skip the preprocessing steps.
|
52 |
+
|
53 |
+
Data Breakdown
|
54 |
+
In this preprocessing stage, there are two main parts:
|
55 |
+
|
56 |
+
Text to Latent Transformation: We use a Sentence Transformer to convert text labels into latent representations.
|
57 |
+
Audio to Mel Spectrogram: The original audio files are converted into mel spectrograms.
|
58 |
+
Both the latent representations and mel spectrograms are stored in an HDF5 file, making them easily accessible during training and inference.
|
59 |
+
|
60 |
+
## Training
|
61 |
+
|
62 |
+
To begin training, simply navigate to the `Musimple` directory and run the following command:
|
63 |
+
|
64 |
+
```bash
|
65 |
+
cd Musimple
|
66 |
+
python train.py
|
67 |
+
```
|
68 |
+
|
69 |
+
Configurable Parameters
|
70 |
+
All training-related parameters can be adjusted in the configuration file located at:
|
71 |
+
```
|
72 |
+
./config/train.yaml
|
73 |
+
```
|
74 |
+
This allows you to easily modify aspects like the learning rate, batch size, number of epochs, and more to suit your hardware or dataset requirements.
|
75 |
+
|
76 |
+
We also provide a **pre-trained checkpoint** trained for two days on a single **NVIDIA RTX 4090**. You can use this checkpoint for inference or fine-tuning. The key training parameters for this checkpoint are as follows:
|
77 |
+
|
78 |
+
- `batch_size`: 48
|
79 |
+
- `mel_frames`: 800
|
80 |
+
- `lr`: 0.0001
|
81 |
+
- `num_epochs`: 100000
|
82 |
+
- `sample_interval`: 250
|
83 |
+
- `h5_file_path`: './dataset/gtzan_train.h5'
|
84 |
+
- `device`: 'cuda:4'
|
85 |
+
- `input_size`: [80, 800]
|
86 |
+
- `patch_size`: 8
|
87 |
+
- `in_channels`: 1
|
88 |
+
- `hidden_size`: 384
|
89 |
+
- `depth`: 12
|
90 |
+
- `num_heads`: 6
|
91 |
+
- `checkpoint_dir`: 'gtzan-ck'
|
92 |
+
|
93 |
+
You can modify the model architecture and parameters in the `train.yaml` configuration file to compare your models against ours. We will continue to release more checkpoints and models in future updates.
|
94 |
+
|
95 |
+
## Inference
|
96 |
+
|
97 |
+
Once you have trained your own model, you can perform inference using the trained model. To do so, run the following command:
|
98 |
+
|
99 |
+
```bash
|
100 |
+
python sample.py --checkpoint ./gtzan-ck/model_epoch_20000.pt \
|
101 |
+
--h5_file ./dataset/gtzan_test.h5 \
|
102 |
+
--output_gt_dir ./sample/gt \
|
103 |
+
--output_gen_dir ./sample/gn \
|
104 |
+
--segment_length 800 \
|
105 |
+
--sample_rate 22050
|
106 |
+
```
|
107 |
+
You can also try running inference using our pre-trained model to familiarize yourself with the inference process. We have saved some inference results in the sample folder as a demo. However, due to the limited size of our model, the generated results are not of the highest quality and are intended as simple examples to guide further evaluation.
|
108 |
+
|
109 |
+
## Evaluation
|
110 |
+
|
111 |
+
For the evaluation phase, we highly recommend creating a new environment and using the evaluation library available at [Generated Music Evaluation](https://github.com/HarlandZZC/generated_music_evaluation). This repository provides detailed instructions on setting up the environment and how to use the evaluation tools. New features and functionality will be added to this library over time.
|
112 |
+
|
113 |
+
Once you have set up the environment following the instructions from the evaluation repository, you can run the following script to evaluate your generated music:
|
114 |
+
|
115 |
+
```bash
|
116 |
+
python eval.py \
|
117 |
+
--ref_path ../sample/gt \
|
118 |
+
--gen_path ../sample/gn \
|
119 |
+
--id2text_csv_path ../gtzan-test.csv \
|
120 |
+
--output_path ./output \
|
121 |
+
--device_id 0 \
|
122 |
+
--batch_size 32 \
|
123 |
+
--original_sample_rate 24000 \
|
124 |
+
--fad_sample_rate 16000 \
|
125 |
+
--kl_sample_rate 16000 \
|
126 |
+
--clap_sample_rate 48000 \
|
127 |
+
--run_fad 1 \
|
128 |
+
--run_kl 1 \
|
129 |
+
--run_clap 1
|
130 |
+
```
|
131 |
+
|
132 |
+
This script evaluates the generated music against reference music, producing evaluation metrics such as CLAP, KL, and FAD scores.
|
133 |
+
|
134 |
+
## To-Do
|
135 |
+
|
136 |
+
The following features and improvements are planned for future updates:
|
137 |
+
|
138 |
+
- **EMA Model**: Implement Exponential Moving Average (EMA) for model weights to stabilize training and improve final generation quality.
|
139 |
+
- **Long-Term Music Fine-tuning**: Explore fine-tuning the model to generate longer-term music with more coherent structures.
|
140 |
+
- **VAE Integration**: Integrate a Variational Autoencoder (VAE) to improve latent space representations and potentially enhance generation diversity.
|
141 |
+
- **T5-based Text Conditioning**: Add T5 to enhance text conditioning, improving the control and accuracy of the text-to-music generation process.
|