Update Readme
Browse filesUpdate model details.
README.md
CHANGED
@@ -103,13 +103,13 @@ Yi-VL offers the following features:
|
|
103 |
|
104 |
## Architecture
|
105 |
|
106 |
-
Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of
|
107 |
|
108 |
- Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
|
109 |
|
110 |
-
- Projection Module: it
|
111 |
|
112 |
-
- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat).
|
113 |
|
114 |
![Yi-VL architecture]()
|
115 |
|
@@ -117,13 +117,15 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
|
|
117 |
|
118 |
## Training
|
119 |
|
120 |
-
Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a three-stage training process:
|
121 |
|
122 |
-
- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen.
|
123 |
|
124 |
-
- Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained.
|
125 |
|
126 |
-
- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained.
|
|
|
|
|
127 |
|
128 |
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
129 |
|
@@ -181,8 +183,8 @@ Notes:
|
|
181 |
|
182 |
- You need to modify the system prompt as follows.
|
183 |
|
184 |
-
```
|
185 |
-
This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's
|
186 |
|
187 |
### Human: <image_placeholder>
|
188 |
What is it in the image?
|
|
|
103 |
|
104 |
## Architecture
|
105 |
|
106 |
+
Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
|
107 |
|
108 |
- Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
|
109 |
|
110 |
+
- Projection Module: it's designed to align image features with text feature spcae, consists of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
|
111 |
|
112 |
+
- Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
|
113 |
|
114 |
![Yi-VL architecture]()
|
115 |
|
|
|
117 |
|
118 |
## Training
|
119 |
|
120 |
+
Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
|
121 |
|
122 |
+
- Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
|
123 |
|
124 |
+
- Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs.
|
125 |
|
126 |
+
- Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
|
127 |
+
|
128 |
+
In Stage 1 and 2, the global batch size, the learning rate, the gradient clip and the number of epoch are set to 4096, 1e-4, 0.5 and 1, respectively. In Stage 3, these parameters are adjusted to 256, 2e-5, 1.0 and 2. The training consumes 128 NVIDIA A100 GPUs. The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
|
129 |
|
130 |
<div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
|
131 |
|
|
|
183 |
|
184 |
- You need to modify the system prompt as follows.
|
185 |
|
186 |
+
```
|
187 |
+
This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
|
188 |
|
189 |
### Human: <image_placeholder>
|
190 |
What is it in the image?
|