metadata

title: LeVo Song Generation
emoji: 🎵
colorFrom: purple
colorTo: gray
sdk: docker
app_port: 7860

SongGeration:

This repository is the official code repository for LeVo: High-Quality Song Generation with Multi-Preference Alignment. You can find our paper on here. The demo page is available here.

In this repository, we provide the SongGeration model, inference scripts, and the checkpoint that has been trained on the Million Song Dataset. Specifically, we have released the model and inference code corresponding to the SFT + auto-DPO version.

Installation

Start from scatch

You can install the necessary dependencies using the requirements.txt file with Python 3.8.12:

pip install -r requirements.txt

then install flash attention from wget

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl -P /home/
pip install /home/flash_attn-2.7.4.post1+cu12torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Start with docker

docker pull juhayna/song-generation-levo:v0.1
docker run -it --gpus all --network=host juhayna/song-generation-levo:v0.1 /bin/bash

Inference

Please note that all the two folder below must be downloaded completely for the model to load correctly, which is sourced from here

Save ckpt to the root directory
Save third_party to the root directory

Then run inference, use the following command:

sh generate.sh sample/lyric.jsonl sample/generate

Input keys in the sample/lyric.jsonl
- idx: name of the generate song file
- descriptions: text description, can be None or specified gender, timbre, genre, mood, instrument and BPM
- prompt_audio_path: reference audio path, can be None or 10s song audio path
- gt_lyric: lyrics, it needs to follow the format of '[Structure] Text', supported structures can be found in conf/vocab.yaml
Outputs of the loader sample/generate:
- audio: generated audio files
- jsonl: output jsonls
- token: Token corresponding to the generated audio files

Note

Since the model is trained based on data longer than 1 minute, if the given lyrics are too short, the model will automatically fill in the lyrics to extend the duration.

License

The code and weights in this repository is released under the MIT license as found in the LICENSE file.