SreyanG-NVIDIA commited on
Commit
34e09a2
Β·
verified Β·
1 Parent(s): a0739ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -1
README.md CHANGED
@@ -11,4 +11,140 @@ license: other
11
  short_description: Audio Flamingo 3 demo for multi-turn multi-audio chat
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: Audio Flamingo 3 demo for multi-turn multi-audio chat
12
  ---
13
 
14
+
15
+ <div align="center" style="display: flex; justify-content: center; align-items: center; text-align: center;">
16
+ <a href="https://github.com/NVIDIA/audio-flamingo" style="margin-right: 20px; text-decoration: none; display: flex; align-items: center;">
17
+ <img src="static/logo-no-bg.png" alt="Audio Flamingo 3 πŸ”₯πŸš€πŸ”₯" width="120">
18
+ </a>
19
+ </div>
20
+ <div align="center" style="display: flex; justify-content: center; align-items: center; text-align: center;">
21
+ <h2>
22
+ Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models
23
+ </h2>
24
+ </div>
25
+
26
+ <div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
27
+ <a href=""><img src="https://img.shields.io/badge/arXiv-2503.03983-AD1C18" style="margin-right: 5px;"></a>
28
+ <a href="https://research.nvidia.com/labs/adlr/AF3/"><img src="https://img.shields.io/badge/Demo page-228B22" style="margin-right: 5px;"></a>
29
+ <a href="https://github.com/NVIDIA/audio-flamingo"><img src='https://img.shields.io/badge/Github-Audio Flamingo 3-9C276A' style="margin-right: 5px;"></a>
30
+ <a href="https://github.com/NVIDIA/audio-flamingo/stargazers"><img src="https://img.shields.io/github/stars/NVIDIA/audio-flamingo.svg?style=social"></a>
31
+ </div>
32
+
33
+ <div align="center" style="display: flex; justify-content: center; margin-top: 10px; flex-wrap: wrap; gap: 5px;">
34
+ <a href="https://huggingface.co/nvidia/audio-flamingo-3">
35
+ <img src="https://img.shields.io/badge/πŸ€—-Checkpoints-ED5A22.svg">
36
+ </a>
37
+ <a href="https://huggingface.co/nvidia/audio-flamingo-3-chat">
38
+ <img src="https://img.shields.io/badge/πŸ€—-Checkpoints (Chat)-ED5A22.svg">
39
+ </a>
40
+ <a href="https://huggingface.co/datasets/nvidia/AudioSkills">
41
+ <img src="https://img.shields.io/badge/πŸ€—-Dataset: AudioSkills--XL-ED5A22.svg">
42
+ </a>
43
+ <a href="https://huggingface.co/datasets/nvidia/LongAudio">
44
+ <img src="https://img.shields.io/badge/πŸ€—-Dataset: LongAudio--XL-ED5A22.svg">
45
+ </a>
46
+ <a href="https://huggingface.co/datasets/nvidia/AF-Chat">
47
+ <img src="https://img.shields.io/badge/πŸ€—-Dataset: AF--Chat-ED5A22.svg">
48
+ </a>
49
+ <a href="https://huggingface.co/datasets/nvidia/AF-Think">
50
+ <img src="https://img.shields.io/badge/πŸ€—-Dataset: AF--Think-ED5A22.svg">
51
+ </a>
52
+ </div>
53
+
54
+ <div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
55
+ <a href="https://huggingface.co/spaces/nvidia/audio_flamingo_3"><img src="https://img.shields.io/badge/πŸ€—-Gradio Demo (7B)-5F9EA0.svg" style="margin-right: 5px;"></a>
56
+ </div>
57
+
58
+ ## Overview
59
+
60
+ This repo contains the PyTorch implementation of [Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models](). Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:
61
+
62
+ - Unified audio representation learning (speech, sound, music)
63
+ - Flexible, on-demand chain-of-thought reasoning (Thinking in Audio)
64
+ - Long-context audio comprehension (including speech and up to 10 minutes)
65
+ - Multi-turn, multi-audio conversational dialogue (AF3-Chat)
66
+ - Voice-to-voice interaction (AF3-Chat)
67
+
68
+ Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
69
+
70
+
71
+ ## Main Results
72
+
73
+ Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni.LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.
74
+
75
+ <div align="center">
76
+ <img class="img-full" src="static/af3_radial-1.png" width="300">
77
+ </div>
78
+
79
+ <div align="center">
80
+ <img class="img-full" src="static/af3_sota.png" width="400">
81
+ </div>
82
+
83
+ ## Audio Flamingo 3 Architecture
84
+
85
+ Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat).
86
+ Audio Flamingo 3 can take up to 10 minutes of audio inputs.
87
+
88
+ <div align="center">
89
+ <img class="img-full" src="static/af3_main_diagram-1.png" width="800">
90
+ </div>
91
+
92
+ ## Installation
93
+
94
+ ```bash
95
+ ./environment_setup.sh af3
96
+ ```
97
+
98
+ ## Code Structure
99
+
100
+ - The folder ```audio_flamingo_3/``` contains the main training and inference code of Audio Flamingo 3.
101
+ - The folder ```audio_flamingo_3/scripts``` contains the inference scripts of Audio Flamingo 3 in case you would like to use our pretrained checkpoints on HuggingFace.
102
+
103
+ Each folder is self-contained and we expect no cross dependencies between these folders. This repo does not contain the code for Streaming-TTS pipeline which will released in the near future.
104
+
105
+ ## Single Line Inference
106
+
107
+ To infer stage 3 model directly, run the command below:
108
+ ```bash
109
+ python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav
110
+ ```
111
+
112
+ To infer the model in stage 3.5 model, run the command below:
113
+ ```bash
114
+ python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --model-path /path/to/checkpoint/af3-7b/stage35 --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav --peft-mode
115
+ ```
116
+
117
+ ## References
118
+
119
+ The main training and inferencing code within each folder are modified from [NVILA](https://github.com/NVlabs/VILA/tree/main) [Apache license](incl_licenses/License_1.md).
120
+
121
+ ## License
122
+
123
+ - The code in this repo is under [MIT license](incl_licenses/MIT_license.md).
124
+ - The checkpoints are for non-commercial use only [NVIDIA OneWay Noncommercial License](incl_licenses/NVIDIA_OneWay_Noncommercial_License.docx). They are also subject to the [Qwen Research license](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE), the [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and the original licenses accompanying each training dataset.
125
+ - Notice: Audio Flamingo 3 is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
126
+
127
+
128
+ ## Citation
129
+
130
+ - Audio Flamingo 2
131
+ ```
132
+ @article{ghosh2025audio,
133
+ title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
134
+ author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
135
+ journal={arXiv preprint arXiv:2503.03983},
136
+ year={2025}
137
+ }
138
+ ```
139
+
140
+ - Audio Flamingo
141
+ ```
142
+ @inproceedings{kong2024audio,
143
+ title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
144
+ author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
145
+ booktitle={International Conference on Machine Learning},
146
+ pages={25125--25148},
147
+ year={2024},
148
+ organization={PMLR}
149
+ }
150
+ ```