Add pipeline_tag and paper link
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -3,19 +3,29 @@ license: mit
|
|
3 |
tags:
|
4 |
- text-to-audio
|
5 |
- controlnet
|
|
|
|
|
6 |
---
|
7 |
|
8 |
<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
|
9 |
|
10 |
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
|
13 |
|
14 |
-
๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio](https://huggingface.co/spaces/OpenSound/EzAudio)
|
15 |
|
16 |
-
๐ฎ EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
17 |
|
18 |
-
We want to thank Hugging Face Space and Gradio for providing incredible demo platform.
|
19 |
|
20 |
## Installation
|
21 |
|
@@ -28,38 +38,56 @@ Install the dependencies:
|
|
28 |
cd EzAudio
|
29 |
pip install -r requirements.txt
|
30 |
```
|
31 |
-
|
|
|
|
|
32 |
|
33 |
## Usage
|
34 |
|
35 |
You can use the model with the following code:
|
36 |
|
37 |
```python
|
38 |
-
from api.ezaudio import
|
39 |
-
|
40 |
-
|
41 |
-
config_name = 'ckpts/ezaudio-xl.yml'
|
42 |
-
ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
|
43 |
-
vae_path = 'ckpts/vae/1m.pt'
|
44 |
-
# save_path = 'output/'
|
45 |
-
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
46 |
|
47 |
# load model
|
48 |
-
(
|
49 |
-
|
50 |
-
vae_path, device)
|
51 |
|
|
|
52 |
prompt = "a dog barking in the distance"
|
53 |
-
sr, audio = generate_audio(prompt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
|
|
|
|
|
|
|
55 |
```
|
56 |
|
57 |
## Todo
|
58 |
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
59 |
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
60 |
-
- [x] Release inference code
|
61 |
-
- [
|
62 |
-
- [
|
|
|
63 |
|
64 |
## Reference
|
65 |
|
@@ -75,4 +103,4 @@ If you find the code useful for your research, please consider citing:
|
|
75 |
```
|
76 |
|
77 |
## Acknowledgement
|
78 |
-
Some
|
|
|
3 |
tags:
|
4 |
- text-to-audio
|
5 |
- controlnet
|
6 |
+
pipeline_tag: text-to-audio
|
7 |
+
library_name: diffusers
|
8 |
---
|
9 |
|
10 |
<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
|
11 |
|
12 |
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
|
13 |
|
14 |
+
[EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)
|
15 |
+
|
16 |
+
**Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .
|
17 |
+
|
18 |
+
[](https://haidog-yaqub.github.io/EzAudio-Page/)
|
19 |
+
[](https://arxiv.org/abs/2409.10819)
|
20 |
+
[](https://huggingface.co/spaces/OpenSound/EzAudio)
|
21 |
+
|
22 |
๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
|
23 |
|
24 |
+
๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
25 |
|
26 |
+
๐ฎ EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
27 |
|
28 |
+
<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->
|
29 |
|
30 |
## Installation
|
31 |
|
|
|
38 |
cd EzAudio
|
39 |
pip install -r requirements.txt
|
40 |
```
|
41 |
+
|
42 |
+
Download checkponts (Optional):
|
43 |
+
[https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
|
44 |
|
45 |
## Usage
|
46 |
|
47 |
You can use the model with the following code:
|
48 |
|
49 |
```python
|
50 |
+
from api.ezaudio import EzAudio
|
51 |
+
import torch
|
52 |
+
import soundfile as sf
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
# load model
|
55 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
56 |
+
ezaudio = EzAudio(model_name='s3_xl', device=device)
|
|
|
57 |
|
58 |
+
# text to audio genertation
|
59 |
prompt = "a dog barking in the distance"
|
60 |
+
sr, audio = ezaudio.generate_audio(prompt)
|
61 |
+
sf.write(f'{prompt}.wav', audio, sr)
|
62 |
+
|
63 |
+
# audio inpainting
|
64 |
+
prompt = "A train passes by, blowing its horns"
|
65 |
+
original_audio = 'ref.wav'
|
66 |
+
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
|
67 |
+
mask_start=1, mask_length=5)
|
68 |
+
sf.write(f'{prompt}_edit.wav', audio, sr)
|
69 |
+
```
|
70 |
+
|
71 |
+
## Training
|
72 |
+
|
73 |
+
#### Autoencoder
|
74 |
+
Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)
|
75 |
+
|
76 |
+
#### T2A Diffusion Model
|
77 |
+
Prepare your data (see example in `src/dataset/meta_example.csv`), then run:
|
78 |
|
79 |
+
```bash
|
80 |
+
cd src
|
81 |
+
accelerate launch train.py
|
82 |
```
|
83 |
|
84 |
## Todo
|
85 |
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
86 |
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
87 |
+
- [x] Release inference code
|
88 |
+
- [x] Release training pipeline and dataset
|
89 |
+
- [x] Improve API and support automatic ckpts downloading
|
90 |
+
- [ ] Release checkpoints for stage1 and stage2 [WIP]
|
91 |
|
92 |
## Reference
|
93 |
|
|
|
103 |
```
|
104 |
|
105 |
## Acknowledgement
|
106 |
+
Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).
|