BellaBei commited on
Commit
ba27432
1 Parent(s): a8fa12e

Update Readme

Browse files

Update model details.

Files changed (1) hide show
  1. README.md +11 -9
README.md CHANGED
@@ -103,13 +103,13 @@ Yi-VL offers the following features:
103
 
104
  ## Architecture
105
 
106
- Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of the following components:
107
 
108
  - Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
109
 
110
- - Projection Module: it builds a bridge between the ViT and LLM using a 2-layer MLP with layer normalization.
111
 
112
- - Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat).
113
 
114
  ![Yi-VL architecture]()
115
 
@@ -117,13 +117,15 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
117
 
118
  ## Training
119
 
120
- Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a three-stage training process:
121
 
122
- - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen.
123
 
124
- - Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained.
125
 
126
- - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained.
 
 
127
 
128
  <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
129
 
@@ -181,8 +183,8 @@ Notes:
181
 
182
  - You need to modify the system prompt as follows.
183
 
184
- ```bash
185
- This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
186
 
187
  ### Human: <image_placeholder>
188
  What is it in the image?
 
103
 
104
  ## Architecture
105
 
106
+ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components:
107
 
108
  - Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding.
109
 
110
+ - Projection Module: it's designed to align image features with text feature spcae, consists of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
111
 
112
+ - Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
113
 
114
  ![Yi-VL architecture]()
115
 
 
117
 
118
  ## Training
119
 
120
+ Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
121
 
122
+ - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224&times;224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
123
 
124
+ - Stage 2: The image resolution of ViT is scaled up to 448&times;448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs.
125
 
126
+ - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
127
+
128
+ In Stage 1 and 2, the global batch size, the learning rate, the gradient clip and the number of epoch are set to 4096, 1e-4, 0.5 and 1, respectively. In Stage 3, these parameters are adjusted to 256, 2e-5, 1.0 and 2. The training consumes 128 NVIDIA A100 GPUs. The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
129
 
130
  <div align="right"> [ <a href="#building-the-next-generation-of-bilingual-visual-language-models">Back to top ⬆️ </a> ] </div>
131
 
 
183
 
184
  - You need to modify the system prompt as follows.
185
 
186
+ ```
187
+ This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
188
 
189
  ### Human: <image_placeholder>
190
  What is it in the image?