Add pipeline_tag and paper link

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +48 -20
README.md CHANGED
@@ -3,19 +3,29 @@ license: mit
3
  tags:
4
  - text-to-audio
5
  - controlnet
 
 
6
  ---
7
 
8
  <img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
9
 
10
  # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
11
 
 
 
 
 
 
 
 
 
12
  ๐ŸŸฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
13
 
14
- ๐ŸŽ› Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio](https://huggingface.co/spaces/OpenSound/EzAudio)
15
 
16
- ๐ŸŽฎ EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
17
 
18
- We want to thank Hugging Face Space and Gradio for providing incredible demo platform.
19
 
20
  ## Installation
21
 
@@ -28,38 +38,56 @@ Install the dependencies:
28
  cd EzAudio
29
  pip install -r requirements.txt
30
  ```
31
- Download checkponts from: [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
 
 
32
 
33
  ## Usage
34
 
35
  You can use the model with the following code:
36
 
37
  ```python
38
- from api.ezaudio import load_models, generate_audio
39
-
40
- # model and config paths
41
- config_name = 'ckpts/ezaudio-xl.yml'
42
- ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
43
- vae_path = 'ckpts/vae/1m.pt'
44
- # save_path = 'output/'
45
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
46
 
47
  # load model
48
- (autoencoder, unet, tokenizer,
49
- text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
50
- vae_path, device)
51
 
 
52
  prompt = "a dog barking in the distance"
53
- sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
 
 
 
55
  ```
56
 
57
  ## Todo
58
  - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
59
  - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
60
- - [x] Release inference code
61
- - [ ] Release checkpoints for stage1 and stage2
62
- - [ ] Release training pipeline and dataset
 
63
 
64
  ## Reference
65
 
@@ -75,4 +103,4 @@ If you find the code useful for your research, please consider citing:
75
  ```
76
 
77
  ## Acknowledgement
78
- Some code are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).
 
3
  tags:
4
  - text-to-audio
5
  - controlnet
6
+ pipeline_tag: text-to-audio
7
+ library_name: diffusers
8
  ---
9
 
10
  <img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
11
 
12
  # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
13
 
14
+ [EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)
15
+
16
+ **Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .
17
+
18
+ [![Official Page](https://img.shields.io/badge/Official%20Page-EzAudio-blue?logo=Github&style=flat-square)](https://haidog-yaqub.github.io/EzAudio-Page/)
19
+ [![arXiv](https://img.shields.io/badge/arXiv-2409.10819-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2409.10819)
20
+ [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/OpenSound/EzAudio)
21
+
22
  ๐ŸŸฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
23
 
24
+ ๐ŸŽ› Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
25
 
26
+ ๐ŸŽฎ EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
27
 
28
+ <!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->
29
 
30
  ## Installation
31
 
 
38
  cd EzAudio
39
  pip install -r requirements.txt
40
  ```
41
+
42
+ Download checkponts (Optional):
43
+ [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
44
 
45
  ## Usage
46
 
47
  You can use the model with the following code:
48
 
49
  ```python
50
+ from api.ezaudio import EzAudio
51
+ import torch
52
+ import soundfile as sf
 
 
 
 
 
53
 
54
  # load model
55
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
56
+ ezaudio = EzAudio(model_name='s3_xl', device=device)
 
57
 
58
+ # text to audio genertation
59
  prompt = "a dog barking in the distance"
60
+ sr, audio = ezaudio.generate_audio(prompt)
61
+ sf.write(f'{prompt}.wav', audio, sr)
62
+
63
+ # audio inpainting
64
+ prompt = "A train passes by, blowing its horns"
65
+ original_audio = 'ref.wav'
66
+ sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
67
+ mask_start=1, mask_length=5)
68
+ sf.write(f'{prompt}_edit.wav', audio, sr)
69
+ ```
70
+
71
+ ## Training
72
+
73
+ #### Autoencoder
74
+ Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)
75
+
76
+ #### T2A Diffusion Model
77
+ Prepare your data (see example in `src/dataset/meta_example.csv`), then run:
78
 
79
+ ```bash
80
+ cd src
81
+ accelerate launch train.py
82
  ```
83
 
84
  ## Todo
85
  - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
86
  - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
87
+ - [x] Release inference code
88
+ - [x] Release training pipeline and dataset
89
+ - [x] Improve API and support automatic ckpts downloading
90
+ - [ ] Release checkpoints for stage1 and stage2 [WIP]
91
 
92
  ## Reference
93
 
 
103
  ```
104
 
105
  ## Acknowledgement
106
+ Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).