anonymitaet
commited on
Commit
•
71209f4
1
Parent(s):
8581fc1
[doc][feat] add readme
Browse files
README.md
CHANGED
@@ -2,4 +2,184 @@
|
|
2 |
license: other
|
3 |
license_name: yi-license
|
4 |
license_link: LICENSE
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: other
|
3 |
license_name: yi-license
|
4 |
license_link: LICENSE
|
5 |
+
---
|
6 |
+
<div align="center">
|
7 |
+
|
8 |
+
<picture>
|
9 |
+
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_dark.svg" width="200px">
|
10 |
+
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg" width="200px">
|
11 |
+
<img alt="specify theme context for images" src="https://raw.githubusercontent.com/01-ai/Yi/main/assets/img/Yi_logo_icon_light.svg" width="200px">
|
12 |
+
</picture>
|
13 |
+
|
14 |
+
</div>
|
15 |
+
|
16 |
+
<div align="center">
|
17 |
+
<h3 align="center">Yi-VL: Better Bilingual Multimodal Models</h3>
|
18 |
+
</div>
|
19 |
+
|
20 |
+
<p align="center">
|
21 |
+
🤗 <a href="https://huggingface.co/01-ai" target="_blank">Hugging Face</a> • 🤖 <a href="https://www.modelscope.cn/organization/01ai/" target="_blank">ModelScope</a> • ✡️ <a href="https://wisemodel.cn/organization/01.AI" target="_blank">WiseModel</a>
|
22 |
+
</p>
|
23 |
+
|
24 |
+
<p align="center">
|
25 |
+
👩🚀 Ask questions or discuss ideas on <a href="https://github.com/01-ai/Yi/discussions" target="_blank"> GitHub </a>!
|
26 |
+
</p>
|
27 |
+
|
28 |
+
<p align="center">
|
29 |
+
👋 Join us 💬 <a href="https://github.com/01-ai/Yi/issues/43#issuecomment-1827285245" target="_blank"> WeChat (Chinese) </a>!
|
30 |
+
</p>
|
31 |
+
|
32 |
+
<p align="center">
|
33 |
+
📚 Grow at <a href="https://github.com/01-ai/Yi/blob/main/docs/learning_hub.md"> Yi Learning Hub </a>!
|
34 |
+
</p>
|
35 |
+
|
36 |
+
<hr>
|
37 |
+
|
38 |
+
<!-- DO NOT REMOVE ME -->
|
39 |
+
|
40 |
+
<details open>
|
41 |
+
<summary></b>📕 Table of Contents</b></summary>
|
42 |
+
|
43 |
+
- [What is Yi-VL?](#what-is-yi-vl)
|
44 |
+
- [Overview](#overview)
|
45 |
+
- [Models](#models)
|
46 |
+
- [Features](#features)
|
47 |
+
- [Architecture](#architecture)
|
48 |
+
- [Training](#training)
|
49 |
+
- [Limitations](#limitations)
|
50 |
+
- [Citation](#citation)
|
51 |
+
- [Why Yi-VL?](#why-yi-vl)
|
52 |
+
- [Benchmarks](#benchmarks)
|
53 |
+
- [How to use Yi-VL?](#how-to-use-yi-vl)
|
54 |
+
- [Quick Start](#quick-start)
|
55 |
+
|
56 |
+
</details>
|
57 |
+
|
58 |
+
<hr>
|
59 |
+
|
60 |
+
# What is Yi-VL?
|
61 |
+
|
62 |
+
## Overview
|
63 |
+
|
64 |
+
- **Yi Visual Language (Yi-VL)** model is the open-source, multimodal version of the Yi **Large Language Model (LLM)** series, enabling content comprehension, recognition, and multi-round conversations about images.
|
65 |
+
|
66 |
+
- Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
|
67 |
+
|
68 |
+
- Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
|
69 |
+
|
70 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
71 |
+
|
72 |
+
## Models
|
73 |
+
|
74 |
+
Yi-VL has released the following versions.
|
75 |
+
|
76 |
+
Model | Download
|
77 |
+
|---|---
|
78 |
+
Yi-VL-6B | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-6B)
|
79 |
+
Yi-VL-34B |• [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-VL-34B)
|
80 |
+
|
81 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
82 |
+
|
83 |
+
## Features
|
84 |
+
|
85 |
+
Yi-VL offers the following features:
|
86 |
+
|
87 |
+
- Multi-round text-image conversations: Yi-VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
|
88 |
+
|
89 |
+
- Bilingual text support: Yi-VL supports conversations in both English and Chinese, including text recognition in images.
|
90 |
+
|
91 |
+
- Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
|
92 |
+
|
93 |
+
- Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
|
94 |
+
|
95 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
96 |
+
|
97 |
+
## Architecture
|
98 |
+
|
99 |
+
Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of the following components:
|
100 |
+
|
101 |
+
- Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
|
102 |
+
|
103 |
+
- Projection Module: it builds a bridge between the ViT and LLM using a 2-layer MLP with layer normalization.
|
104 |
+
|
105 |
+
- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat).
|
106 |
+
|
107 |
+
![Yi-VL architecture]()
|
108 |
+
|
109 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
110 |
+
|
111 |
+
## Training
|
112 |
+
|
113 |
+
Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a three-stage training process:
|
114 |
+
|
115 |
+
- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen.
|
116 |
+
|
117 |
+
- Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained.
|
118 |
+
|
119 |
+
- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained.
|
120 |
+
|
121 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
122 |
+
|
123 |
+
## Limitations
|
124 |
+
|
125 |
+
This is the initial release of the Yi-VL, which comes with some known limitations. It is recommended to carefully evaluate potential risks before adopting any models.
|
126 |
+
|
127 |
+
- Feature limitation
|
128 |
+
|
129 |
+
- Visual question answering is supported. Other features like text-to-3D and image-to-video are not yet supported.
|
130 |
+
|
131 |
+
- A single image rather than several images can be accepted as an input.
|
132 |
+
|
133 |
+
- Hallucination problem
|
134 |
+
|
135 |
+
- There is a certain possibility of generating content that does not exist in the image.
|
136 |
+
|
137 |
+
- In scenes containing multiple objects, some objects might be incorrectly identified or described with insufficient detail.
|
138 |
+
|
139 |
+
- Resolution issue
|
140 |
+
|
141 |
+
- Yi-VL is trained on images with a resolution of 448×448. During inference, inputs of any resolution are resized to 448×448. Low-resolution images may result in information loss, and more fine-grained images (above 448) do not bring in extra knowledge.
|
142 |
+
|
143 |
+
- Other limitations of the Yi LLM.
|
144 |
+
|
145 |
+
## Citation
|
146 |
+
|
147 |
+
If you find our work helpful, please feel free to cite us.
|
148 |
+
|
149 |
+
```
|
150 |
+
@article{tbd,
|
151 |
+
year={2024}
|
152 |
+
}
|
153 |
+
```
|
154 |
+
|
155 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
156 |
+
|
157 |
+
# Why Yi-VL?
|
158 |
+
|
159 |
+
## Benchmarks
|
160 |
+
|
161 |
+
Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchmark.github.io/#leaderboard) and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard), two advanced benchmarks that include massive multi-discipline multimodal questions.
|
162 |
+
|
163 |
+
![Yi-VL benchmark]()
|
164 |
+
|
165 |
+
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
166 |
+
|
167 |
+
# How to use Yi-VL?
|
168 |
+
|
169 |
+
## Quick Start
|
170 |
+
|
171 |
+
You can perform inference using the code from [LLaVA](https://github.com/haotian-liu/LLaVA). For detailed steps, see [simple startup for pretraining](https://github.com/haotian-liu/LLaVA/pull/966).
|
172 |
+
|
173 |
+
Notes:
|
174 |
+
|
175 |
+
- You need to modify the system prompt as follows.
|
176 |
+
|
177 |
+
```bash
|
178 |
+
This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
|
179 |
+
|
180 |
+
### Human: <image_placeholder>
|
181 |
+
What is it in the image?
|
182 |
+
### Assistant:
|
183 |
+
```
|
184 |
+
|
185 |
+
- You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
|