Update README.md
Browse files
README.md
CHANGED
@@ -10,24 +10,26 @@ datasets:
|
|
10 |
pipeline_tag: image-feature-extraction
|
11 |
---
|
12 |
|
13 |
-
# Model Card for InternViT-6B-448px
|
14 |
|
15 |
-
|
16 |
|
17 |
-
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\]
|
18 |
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
|
|
|
|
24 |
|
25 |
## Model Details
|
26 |
- **Model Type:** vision foundation model, feature backbone
|
27 |
- **Model Stats:**
|
28 |
- Params (M): 5903
|
29 |
- Image size: 448 x 448
|
30 |
-
- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR
|
31 |
- **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, **please set mm_vision_select_layer=-4 when using this model to build VLLM.**
|
32 |
|
33 |
## Model Usage (Image Embeddings)
|
@@ -38,14 +40,14 @@ from PIL import Image
|
|
38 |
from transformers import AutoModel, CLIPImageProcessor
|
39 |
|
40 |
model = AutoModel.from_pretrained(
|
41 |
-
'OpenGVLab/InternViT-6B-448px',
|
42 |
torch_dtype=torch.bfloat16,
|
43 |
low_cpu_mem_usage=True,
|
44 |
trust_remote_code=True).cuda().eval()
|
45 |
|
46 |
image = Image.open('./examples/image1.jpg').convert('RGB')
|
47 |
|
48 |
-
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px')
|
49 |
|
50 |
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
|
51 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|
|
|
10 |
pipeline_tag: image-feature-extraction
|
11 |
---
|
12 |
|
13 |
+
# Model Card for InternViT-6B-448px-V1-0
|
14 |
|
15 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/AUE-3OBtfr9vDA7Elgkhd.webp" alt="Image Description" width="300" height="300">
|
16 |
|
17 |
+
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
|
18 |
|
19 |
+
| Model | Date | Download | Note |
|
20 |
+
| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
|
21 |
+
| InternViT-6B-448px-V1.5 | 2024.04.20 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (🔥new) |
|
22 |
+
| InternViT-6B-448px-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution |
|
23 |
+
| InternViT-6B-448px-V1.0 | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution |
|
24 |
+
| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
|
25 |
+
| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
|
26 |
|
27 |
## Model Details
|
28 |
- **Model Type:** vision foundation model, feature backbone
|
29 |
- **Model Stats:**
|
30 |
- Params (M): 5903
|
31 |
- Image size: 448 x 448
|
32 |
+
- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi, OCR-related datasets.
|
33 |
- **Note:** This model has 48 blocks, and we found that using the output after the fourth-to-last block worked best for VLLM. Therefore, **please set mm_vision_select_layer=-4 when using this model to build VLLM.**
|
34 |
|
35 |
## Model Usage (Image Embeddings)
|
|
|
40 |
from transformers import AutoModel, CLIPImageProcessor
|
41 |
|
42 |
model = AutoModel.from_pretrained(
|
43 |
+
'OpenGVLab/InternViT-6B-448px-V1-0',
|
44 |
torch_dtype=torch.bfloat16,
|
45 |
low_cpu_mem_usage=True,
|
46 |
trust_remote_code=True).cuda().eval()
|
47 |
|
48 |
image = Image.open('./examples/image1.jpg').convert('RGB')
|
49 |
|
50 |
+
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-0')
|
51 |
|
52 |
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
|
53 |
pixel_values = pixel_values.to(torch.bfloat16).cuda()
|