Update README.md
Browse files
README.md
CHANGED
@@ -33,6 +33,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
|
|
33 |
- Pretraining Stage
|
34 |
- Learnable Component: InternViT-6B
|
35 |
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
|
|
36 |
- SFT Stage
|
37 |
- Learnable Component: MLP + LLM
|
38 |
- Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.
|
@@ -61,6 +62,8 @@ If you find this project useful in your research, please consider cite:
|
|
61 |
|
62 |
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
|
63 |
|
|
|
|
|
64 |
## Acknowledgement
|
65 |
|
66 |
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
|
|
33 |
- Pretraining Stage
|
34 |
- Learnable Component: InternViT-6B
|
35 |
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
36 |
+
- Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448x448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
37 |
- SFT Stage
|
38 |
- Learnable Component: MLP + LLM
|
39 |
- Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.
|
|
|
62 |
|
63 |
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
|
64 |
|
65 |
+
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
66 |
+
|
67 |
## Acknowledgement
|
68 |
|
69 |
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|