nielsr HF staff commited on
Commit
b036489
1 Parent(s): 89ee1eb

Add pipeline tag

Browse files

This PR ensures the model can be found at https://huggingface.co/models?pipeline_tag=any-to-any&sort=trending

Files changed (1) hide show
  1. README.md +24 -23
README.md CHANGED
@@ -1,24 +1,25 @@
1
- ---
2
- license: mit
3
- ---
4
- # VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
5
-
6
- ## Abstract
7
- VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
8
-
9
- ## Useful links
10
-
11
- - Paper: https://arxiv.org/abs/2409.04429
12
- - GitHub: https://github.com/mit-han-lab/vila-u
13
- - Project: https://hanlab.mit.edu/projects/vila-u
14
-
15
- ## Citation
16
-
17
- ```bibtex
18
- @article{wu2024vila,
19
- title={Vila-u: a unified foundation model integrating visual understanding and generation},
20
- author={Wu, Yecheng and Zhang, Zhuoyang and Chen, Junyu and Tang, Haotian and Li, Dacheng and Fang, Yunhao and Zhu, Ligeng and Xie, Enze and Yin, Hongxu and Yi, Li and others},
21
- journal={arXiv preprint arXiv:2409.04429},
22
- year={2024}
23
- }
 
24
  ```
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: any-to-any
4
+ ---
5
+ # VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
6
+
7
+ ## Abstract
8
+ VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
9
+
10
+ ## Useful links
11
+
12
+ - Paper: https://arxiv.org/abs/2409.04429
13
+ - GitHub: https://github.com/mit-han-lab/vila-u
14
+ - Project: https://hanlab.mit.edu/projects/vila-u
15
+
16
+ ## Citation
17
+
18
+ ```bibtex
19
+ @article{wu2024vila,
20
+ title={Vila-u: a unified foundation model integrating visual understanding and generation},
21
+ author={Wu, Yecheng and Zhang, Zhuoyang and Chen, Junyu and Tang, Haotian and Li, Dacheng and Fang, Yunhao and Zhu, Ligeng and Xie, Enze and Yin, Hongxu and Yi, Li and others},
22
+ journal={arXiv preprint arXiv:2409.04429},
23
+ year={2024}
24
+ }
25
  ```