ZheqiDAI commited on
Commit
ea731fd
1 Parent(s): 070e26e

Initial commit

Browse files
Files changed (2) hide show
  1. README copy.md +0 -141
  2. README.md +141 -3
README copy.md DELETED
@@ -1,141 +0,0 @@
1
- # Musimple:Text2Music with DiT Made simple
2
-
3
- ## Introduction
4
-
5
- This repository provides a simple and clear implementation of a **Text-to-Music Generation** pipeline using a **DiT (Diffusion Transformer)** model. The codebase includes key components such as **model training**, **inference**, and **evaluation**. We use the **GTZAN dataset** as an example to demonstrate a minimal, working pipeline for text-conditioned music generation.
6
-
7
- The repository is designed to be easy to use and customize, making it simple to reproduce our results on a single **NVIDIA RTX 4090 GPU**. Additionally, the code is structured to be flexible, allowing you to modify it for your own tasks and datasets.
8
-
9
- We plan to continue maintaining and improving this repository with new features, model improvements, and extended documentation in the future.
10
-
11
- ## Features
12
-
13
- - **Text-to-Music Generation**: Generate music directly from text descriptions using a DiT model.
14
- - **GTZAN Example**: A simple pipeline using the GTZAN dataset to demonstrate the workflow.
15
- - **End-to-End Pipeline**: Includes model training, inference, and evaluation with support for generating audio files.
16
- - **Customizable**: Easy to modify and extend for different datasets or use cases.
17
- - **Single GPU Training**: Optimized for training on a single RTX 4090 GPU but adaptable to different hardware setups.
18
-
19
- ## Requirements
20
-
21
- Before using the code, ensure that the following dependencies are installed:
22
-
23
- - Python >= 3.9
24
- - CUDA (if available)
25
- - Required Python libraries from `requirements.txt`
26
-
27
- You can install the dependencies using:
28
-
29
- ```bash
30
- conda create -n musimple python=3.9
31
- conda activate musimple
32
- pip install -r requirements.txt
33
- ```
34
-
35
-
36
- ## Data Preprocessing
37
-
38
- To begin with, you will need to download the **GTZAN dataset**. Once downloaded, you can use the `gtzan_split.py` script located in the `tools` directory to split the dataset into training and testing sets. Run the following command:
39
-
40
- ```bash
41
- python gtzan_split.py --root_dir /path/to/gtzan/genres --output_dir /path/to/output/directory
42
- ```
43
-
44
- Next, convert the audio files into an HDF5 format using the gtzan2h5.py script:
45
-
46
- ```bash
47
- python gtzan2h5.py --root_dir /path/to/audio/files --output_h5_file /path/to/output.h5 --config_path bigvgan_v2_22khz_80band_256x/config.json --sr 22050
48
- ```
49
-
50
- Preprocessed Data
51
- If this process seems cumbersome, don’t worry! **We have already preprocessed the dataset**, and you can find it in the **musimple/dataset** directory. You can download and use this data directly to skip the preprocessing steps.
52
-
53
- Data Breakdown
54
- In this preprocessing stage, there are two main parts:
55
-
56
- Text to Latent Transformation: We use a Sentence Transformer to convert text labels into latent representations.
57
- Audio to Mel Spectrogram: The original audio files are converted into mel spectrograms.
58
- Both the latent representations and mel spectrograms are stored in an HDF5 file, making them easily accessible during training and inference.
59
-
60
- ## Training
61
-
62
- To begin training, simply navigate to the `Musimple` directory and run the following command:
63
-
64
- ```bash
65
- cd Musimple
66
- python train.py
67
- ```
68
-
69
- Configurable Parameters
70
- All training-related parameters can be adjusted in the configuration file located at:
71
- ```
72
- ./config/train.yaml
73
- ```
74
- This allows you to easily modify aspects like the learning rate, batch size, number of epochs, and more to suit your hardware or dataset requirements.
75
-
76
- We also provide a **pre-trained checkpoint** trained for two days on a single **NVIDIA RTX 4090**. You can use this checkpoint for inference or fine-tuning. The key training parameters for this checkpoint are as follows:
77
-
78
- - `batch_size`: 48
79
- - `mel_frames`: 800
80
- - `lr`: 0.0001
81
- - `num_epochs`: 100000
82
- - `sample_interval`: 250
83
- - `h5_file_path`: './dataset/gtzan_train.h5'
84
- - `device`: 'cuda:4'
85
- - `input_size`: [80, 800]
86
- - `patch_size`: 8
87
- - `in_channels`: 1
88
- - `hidden_size`: 384
89
- - `depth`: 12
90
- - `num_heads`: 6
91
- - `checkpoint_dir`: 'gtzan-ck'
92
-
93
- You can modify the model architecture and parameters in the `train.yaml` configuration file to compare your models against ours. We will continue to release more checkpoints and models in future updates.
94
-
95
- ## Inference
96
-
97
- Once you have trained your own model, you can perform inference using the trained model. To do so, run the following command:
98
-
99
- ```bash
100
- python sample.py --checkpoint ./gtzan-ck/model_epoch_20000.pt \
101
- --h5_file ./dataset/gtzan_test.h5 \
102
- --output_gt_dir ./sample/gt \
103
- --output_gen_dir ./sample/gn \
104
- --segment_length 800 \
105
- --sample_rate 22050
106
- ```
107
- You can also try running inference using our pre-trained model to familiarize yourself with the inference process. We have saved some inference results in the sample folder as a demo. However, due to the limited size of our model, the generated results are not of the highest quality and are intended as simple examples to guide further evaluation.
108
-
109
- ## Evaluation
110
-
111
- For the evaluation phase, we highly recommend creating a new environment and using the evaluation library available at [Generated Music Evaluation](https://github.com/HarlandZZC/generated_music_evaluation). This repository provides detailed instructions on setting up the environment and how to use the evaluation tools. New features and functionality will be added to this library over time.
112
-
113
- Once you have set up the environment following the instructions from the evaluation repository, you can run the following script to evaluate your generated music:
114
-
115
- ```bash
116
- python eval.py \
117
- --ref_path ../sample/gt \
118
- --gen_path ../sample/gn \
119
- --id2text_csv_path ../gtzan-test.csv \
120
- --output_path ./output \
121
- --device_id 0 \
122
- --batch_size 32 \
123
- --original_sample_rate 24000 \
124
- --fad_sample_rate 16000 \
125
- --kl_sample_rate 16000 \
126
- --clap_sample_rate 48000 \
127
- --run_fad 1 \
128
- --run_kl 1 \
129
- --run_clap 1
130
- ```
131
-
132
- This script evaluates the generated music against reference music, producing evaluation metrics such as CLAP, KL, and FAD scores.
133
-
134
- ## To-Do
135
-
136
- The following features and improvements are planned for future updates:
137
-
138
- - **EMA Model**: Implement Exponential Moving Average (EMA) for model weights to stabilize training and improve final generation quality.
139
- - **Long-Term Music Fine-tuning**: Explore fine-tuning the model to generate longer-term music with more coherent structures.
140
- - **VAE Integration**: Integrate a Variational Autoencoder (VAE) to improve latent space representations and potentially enhance generation diversity.
141
- - **T5-based Text Conditioning**: Add T5 to enhance text conditioning, improving the control and accuracy of the text-to-music generation process.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,3 +1,141 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Musimple:Text2Music with DiT Made simple
2
+
3
+ ## Introduction
4
+
5
+ This repository provides a simple and clear implementation of a **Text-to-Music Generation** pipeline using a **DiT (Diffusion Transformer)** model. The codebase includes key components such as **model training**, **inference**, and **evaluation**. We use the **GTZAN dataset** as an example to demonstrate a minimal, working pipeline for text-conditioned music generation.
6
+
7
+ The repository is designed to be easy to use and customize, making it simple to reproduce our results on a single **NVIDIA RTX 4090 GPU**. Additionally, the code is structured to be flexible, allowing you to modify it for your own tasks and datasets.
8
+
9
+ We plan to continue maintaining and improving this repository with new features, model improvements, and extended documentation in the future.
10
+
11
+ ## Features
12
+
13
+ - **Text-to-Music Generation**: Generate music directly from text descriptions using a DiT model.
14
+ - **GTZAN Example**: A simple pipeline using the GTZAN dataset to demonstrate the workflow.
15
+ - **End-to-End Pipeline**: Includes model training, inference, and evaluation with support for generating audio files.
16
+ - **Customizable**: Easy to modify and extend for different datasets or use cases.
17
+ - **Single GPU Training**: Optimized for training on a single RTX 4090 GPU but adaptable to different hardware setups.
18
+
19
+ ## Requirements
20
+
21
+ Before using the code, ensure that the following dependencies are installed:
22
+
23
+ - Python >= 3.9
24
+ - CUDA (if available)
25
+ - Required Python libraries from `requirements.txt`
26
+
27
+ You can install the dependencies using:
28
+
29
+ ```bash
30
+ conda create -n musimple python=3.9
31
+ conda activate musimple
32
+ pip install -r requirements.txt
33
+ ```
34
+
35
+
36
+ ## Data Preprocessing
37
+
38
+ To begin with, you will need to download the **GTZAN dataset**. Once downloaded, you can use the `gtzan_split.py` script located in the `tools` directory to split the dataset into training and testing sets. Run the following command:
39
+
40
+ ```bash
41
+ python gtzan_split.py --root_dir /path/to/gtzan/genres --output_dir /path/to/output/directory
42
+ ```
43
+
44
+ Next, convert the audio files into an HDF5 format using the gtzan2h5.py script:
45
+
46
+ ```bash
47
+ python gtzan2h5.py --root_dir /path/to/audio/files --output_h5_file /path/to/output.h5 --config_path bigvgan_v2_22khz_80band_256x/config.json --sr 22050
48
+ ```
49
+
50
+ Preprocessed Data
51
+ If this process seems cumbersome, don’t worry! **We have already preprocessed the dataset**, and you can find it in the **musimple/dataset** directory. You can download and use this data directly to skip the preprocessing steps.
52
+
53
+ Data Breakdown
54
+ In this preprocessing stage, there are two main parts:
55
+
56
+ Text to Latent Transformation: We use a Sentence Transformer to convert text labels into latent representations.
57
+ Audio to Mel Spectrogram: The original audio files are converted into mel spectrograms.
58
+ Both the latent representations and mel spectrograms are stored in an HDF5 file, making them easily accessible during training and inference.
59
+
60
+ ## Training
61
+
62
+ To begin training, simply navigate to the `Musimple` directory and run the following command:
63
+
64
+ ```bash
65
+ cd Musimple
66
+ python train.py
67
+ ```
68
+
69
+ Configurable Parameters
70
+ All training-related parameters can be adjusted in the configuration file located at:
71
+ ```
72
+ ./config/train.yaml
73
+ ```
74
+ This allows you to easily modify aspects like the learning rate, batch size, number of epochs, and more to suit your hardware or dataset requirements.
75
+
76
+ We also provide a **pre-trained checkpoint** trained for two days on a single **NVIDIA RTX 4090**. You can use this checkpoint for inference or fine-tuning. The key training parameters for this checkpoint are as follows:
77
+
78
+ - `batch_size`: 48
79
+ - `mel_frames`: 800
80
+ - `lr`: 0.0001
81
+ - `num_epochs`: 100000
82
+ - `sample_interval`: 250
83
+ - `h5_file_path`: './dataset/gtzan_train.h5'
84
+ - `device`: 'cuda:4'
85
+ - `input_size`: [80, 800]
86
+ - `patch_size`: 8
87
+ - `in_channels`: 1
88
+ - `hidden_size`: 384
89
+ - `depth`: 12
90
+ - `num_heads`: 6
91
+ - `checkpoint_dir`: 'gtzan-ck'
92
+
93
+ You can modify the model architecture and parameters in the `train.yaml` configuration file to compare your models against ours. We will continue to release more checkpoints and models in future updates.
94
+
95
+ ## Inference
96
+
97
+ Once you have trained your own model, you can perform inference using the trained model. To do so, run the following command:
98
+
99
+ ```bash
100
+ python sample.py --checkpoint ./gtzan-ck/model_epoch_20000.pt \
101
+ --h5_file ./dataset/gtzan_test.h5 \
102
+ --output_gt_dir ./sample/gt \
103
+ --output_gen_dir ./sample/gn \
104
+ --segment_length 800 \
105
+ --sample_rate 22050
106
+ ```
107
+ You can also try running inference using our pre-trained model to familiarize yourself with the inference process. We have saved some inference results in the sample folder as a demo. However, due to the limited size of our model, the generated results are not of the highest quality and are intended as simple examples to guide further evaluation.
108
+
109
+ ## Evaluation
110
+
111
+ For the evaluation phase, we highly recommend creating a new environment and using the evaluation library available at [Generated Music Evaluation](https://github.com/HarlandZZC/generated_music_evaluation). This repository provides detailed instructions on setting up the environment and how to use the evaluation tools. New features and functionality will be added to this library over time.
112
+
113
+ Once you have set up the environment following the instructions from the evaluation repository, you can run the following script to evaluate your generated music:
114
+
115
+ ```bash
116
+ python eval.py \
117
+ --ref_path ../sample/gt \
118
+ --gen_path ../sample/gn \
119
+ --id2text_csv_path ../gtzan-test.csv \
120
+ --output_path ./output \
121
+ --device_id 0 \
122
+ --batch_size 32 \
123
+ --original_sample_rate 24000 \
124
+ --fad_sample_rate 16000 \
125
+ --kl_sample_rate 16000 \
126
+ --clap_sample_rate 48000 \
127
+ --run_fad 1 \
128
+ --run_kl 1 \
129
+ --run_clap 1
130
+ ```
131
+
132
+ This script evaluates the generated music against reference music, producing evaluation metrics such as CLAP, KL, and FAD scores.
133
+
134
+ ## To-Do
135
+
136
+ The following features and improvements are planned for future updates:
137
+
138
+ - **EMA Model**: Implement Exponential Moving Average (EMA) for model weights to stabilize training and improve final generation quality.
139
+ - **Long-Term Music Fine-tuning**: Explore fine-tuning the model to generate longer-term music with more coherent structures.
140
+ - **VAE Integration**: Integrate a Variational Autoencoder (VAE) to improve latent space representations and potentially enhance generation diversity.
141
+ - **T5-based Text Conditioning**: Add T5 to enhance text conditioning, improving the control and accuracy of the text-to-music generation process.