czczup commited on
Commit
a16e917
·
verified ·
1 Parent(s): 2c6e7bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -11
README.md CHANGED
@@ -10,24 +10,26 @@ datasets:
10
  pipeline_tag: image-feature-extraction
11
  ---
12
 
13
- # Model Card for InternViT-6B-448px
14
 
15
- ## What is InternVL?
16
 
17
- \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
18
 
19
- InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
20
-
21
- It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
22
-
23
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k5UATwX5W2b5KJBN5C58x.png)
 
 
24
 
25
  ## Model Details
26
  - **Model Type:** vision foundation model, feature backbone
27
  - **Model Stats:**
28
  - Params (M): 5903
29
  - Image size: 448 x 448
30
- - **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR data
31
  - **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, **please set mm_vision_select_layer=-4 when using this model to build VLLM.**
32
 
33
  ## Model Usage (Image Embeddings)
@@ -38,14 +40,14 @@ from PIL import Image
38
  from transformers import AutoModel, CLIPImageProcessor
39
 
40
  model = AutoModel.from_pretrained(
41
- 'OpenGVLab/InternViT-6B-448px',
42
  torch_dtype=torch.bfloat16,
43
  low_cpu_mem_usage=True,
44
  trust_remote_code=True).cuda().eval()
45
 
46
  image = Image.open('./examples/image1.jpg').convert('RGB')
47
 
48
- image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px')
49
 
50
  pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
51
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
 
10
  pipeline_tag: image-feature-extraction
11
  ---
12
 
13
+ # Model Card for InternViT-6B-448px-V1-0
14
 
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/AUE-3OBtfr9vDA7Elgkhd.webp" alt="Image Description" width="300" height="300">
16
 
17
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
18
 
19
+ | Model | Date | Download | Note |
20
+ | ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
21
+ | InternViT-6B-448px-V1.5 | 2024.04.20 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (🔥new) |
22
+ | InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution |
23
+ | InternViT-6B-448px-V1.0 | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution |
24
+ | InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
25
+ | InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
26
 
27
  ## Model Details
28
  - **Model Type:** vision foundation model, feature backbone
29
  - **Model Stats:**
30
  - Params (M): 5903
31
  - Image size: 448 x 448
32
+ - **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR-related datasets.
33
  - **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, **please set mm_vision_select_layer=-4 when using this model to build VLLM.**
34
 
35
  ## Model Usage (Image Embeddings)
 
40
  from transformers import AutoModel, CLIPImageProcessor
41
 
42
  model = AutoModel.from_pretrained(
43
+ 'OpenGVLab/InternViT-6B-448px-V1-0',
44
  torch_dtype=torch.bfloat16,
45
  low_cpu_mem_usage=True,
46
  trust_remote_code=True).cuda().eval()
47
 
48
  image = Image.open('./examples/image1.jpg').convert('RGB')
49
 
50
+ image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-0')
51
 
52
  pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
53
  pixel_values = pixel_values.to(torch.bfloat16).cuda()