Text-to-Audio
Transformers
English
Inference Endpoints
soujanyaporia commited on
Commit
8832f57
·
1 Parent(s): 8f0a25f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -9,6 +9,8 @@ language:
9
 
10
  **TANGO** is a latent diffusion model for text to audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
11
 
 
 
12
  ## Code
13
 
14
  Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
@@ -60,6 +62,6 @@ audios = tango.generate_for_batch(prompts, samples=2)
60
  ```
61
  This will generate two samples for each of the three text prompts.
62
 
63
- # Limitations
64
 
65
  TANGO is not always able to finely control its generations over textual control prompts as it is trained only on the small AudioCaps dataset. For example, the generations from TANGO for prompts Chopping tomatoes on a wooden table and Chopping potatoes on a metal table are very similar. Chopping vegetables on a table also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings. In the future, we plan to improve TANGO by training it on larger datasets and enhancing its compositional and controllable generation ability.
 
9
 
10
  **TANGO** is a latent diffusion model for text to audio generation. **TANGO** can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We outperform current state-of-the-art models for audio generation across both objective and subjective metrics. We release our model, training, inference code and pre-trained checkpoints for the research community.
11
 
12
+ 📣 We are inviting collaborators and sponsors to train **TANGO** on larger datasets, like AudioSET
13
+
14
  ## Code
15
 
16
  Our code is released here: [https://github.com/declare-lab/tango](https://github.com/declare-lab/tango)
 
62
  ```
63
  This will generate two samples for each of the three text prompts.
64
 
65
+ ## Limitations
66
 
67
  TANGO is not always able to finely control its generations over textual control prompts as it is trained only on the small AudioCaps dataset. For example, the generations from TANGO for prompts Chopping tomatoes on a wooden table and Chopping potatoes on a metal table are very similar. Chopping vegetables on a table also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings. In the future, we plan to improve TANGO by training it on larger datasets and enhancing its compositional and controllable generation ability.