czczup commited on
Commit
f564018
·
verified ·
1 Parent(s): da92483

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -25,7 +25,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
25
  - **Model Type:** multimodal chatbot
26
  - **Model Stats:**
27
  - Architecture: InternViT-6B + MLP + LLaMA2-13B
28
- - Params (M): 19B
29
  - Image size: 448 x 448
30
  - Number of visual tokens: 256
31
 
@@ -33,7 +33,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
33
  - Pretraining Stage
34
  - Learnable Component: InternViT-6B
35
  - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
36
- - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448x448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
37
  - SFT Stage
38
  - Learnable Component: MLP + LLM
39
  - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.
 
25
  - **Model Type:** multimodal chatbot
26
  - **Model Stats:**
27
  - Architecture: InternViT-6B + MLP + LLaMA2-13B
28
+ - Params: 19B
29
  - Image size: 448 x 448
30
  - Number of visual tokens: 256
31
 
 
33
  - Pretraining Stage
34
  - Learnable Component: InternViT-6B
35
  - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
36
+ - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
37
  - SFT Stage
38
  - Learnable Component: MLP + LLM
39
  - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.