OpenGVLab
/

InternViT-6B-448px-V1-0

@@ -10,24 +10,26 @@ datasets:
 pipeline_tag: image-feature-extraction
 ---
-# Model Card for InternViT-6B-448px
-## What is InternVL?
-\[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\]   \[[Chat Demo](https://internvl.opengvlab.com/)\]
-InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
-It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k5UATwX5W2b5KJBN5C58x.png)
 ## Model Details
 - **Model Type:** vision foundation model, feature backbone
 - **Model Stats:**
   - Params (M): 5903
   - Image size: 448 x 448
-- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR data
 - **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, **please set mm_vision_select_layer=-4 when using this model to build VLLM.**
 ## Model Usage (Image Embeddings)
@@ -38,14 +40,14 @@ from PIL import Image
 from transformers import AutoModel, CLIPImageProcessor
 model = AutoModel.from_pretrained(
-    'OpenGVLab/InternViT-6B-448px',
     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     trust_remote_code=True).cuda().eval()
 image = Image.open('./examples/image1.jpg').convert('RGB')
-image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px')
 pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
 pixel_values = pixel_values.to(torch.bfloat16).cuda()

 pipeline_tag: image-feature-extraction
 ---
+# Model Card for InternViT-6B-448px-V1-0
+<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/AUE-3OBtfr9vDA7Elgkhd.webp" alt="Image Description" width="300" height="300">
+\[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
+| Model                   | Date       | Download                                                               | Note                             |
+| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
+| InternViT-6B-448px-V1.5 | 2024.04.20 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (🔥new) |
+| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution                   |
+| InternViT-6B-448px-V1.0 | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution                   |
+| InternViT-6B-224px      | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px)      | vision foundation model          |
+| InternVL-14B-224px      | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px)      | vision-language foundation model |
 ## Model Details
 - **Model Type:** vision foundation model, feature backbone
 - **Model Stats:**
   - Params (M): 5903
   - Image size: 448 x 448
+- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR-related datasets.
 - **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, **please set mm_vision_select_layer=-4 when using this model to build VLLM.**
 ## Model Usage (Image Embeddings)
 from transformers import AutoModel, CLIPImageProcessor
 model = AutoModel.from_pretrained(
+    'OpenGVLab/InternViT-6B-448px-V1-0',
     torch_dtype=torch.bfloat16,
     low_cpu_mem_usage=True,
     trust_remote_code=True).cuda().eval()
 image = Image.open('./examples/image1.jpg').convert('RGB')
+image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-0')
 pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
 pixel_values = pixel_values.to(torch.bfloat16).cuda()