czczup commited on
Commit
41413c4
·
verified ·
1 Parent(s): 49ca3d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -37
README.md CHANGED
@@ -9,33 +9,12 @@ datasets:
9
  - wanng/wukong100m
10
  ---
11
 
12
- # Model Card for InternVL-Chat-Chinese-V1.2
13
-
14
- ## What is InternVL?
15
 
16
  \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
17
 
18
- InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
19
-
20
- ## InternVL-Chat-V1.2 Blog
21
-
22
- > Date: 2024/02/12<br>
23
- > Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
24
-
25
- We are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
26
-
27
  <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
28
 
29
- From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**
30
-
31
- For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.
32
-
33
- ### Data Preparation
34
-
35
- Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
36
-
37
- For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
38
-
39
  ### Performance
40
 
41
  \* Proprietary Model
@@ -53,19 +32,6 @@ For more details about data preparation, please see [here](https://github.com/Op
53
  | InternVL-Chat-V1.2-Plus | 448x448 | 50.3 | 45.6 | 59.9 | 83.8 | 82.0 | 58.7 | 1624/551 | 98.1\* | 88.7 | 71.3\* | 76.4 | - | 66.9 |
54
 
55
  - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
56
- - In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
57
-
58
- ### Training (SFT)
59
-
60
- We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
61
-
62
- For more details about training, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#start-training).
63
-
64
- The hyperparameters used for finetuning are listed in the following table.
65
-
66
- | Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
67
- | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
68
- | InternVL-Chat-V1.2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
69
 
70
 
71
  ## Model Details
@@ -83,7 +49,7 @@ The hyperparameters used for finetuning are listed in the following table.
83
  - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
84
  - SFT Stage
85
  - Learnable Component: ViT + MLP + LLM
86
- - Data: A simplified, fully open-source dataset, containing approximately 1 million samples.
87
 
88
 
89
  ## Model Usage
@@ -101,7 +67,7 @@ from PIL import Image
101
  from transformers import AutoModel, CLIPImageProcessor
102
  from transformers import AutoTokenizer
103
 
104
- path = "OpenGVLab/InternVL-Chat-Chinese-V1-2"
105
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
106
  model = AutoModel.from_pretrained(
107
  path,
 
9
  - wanng/wukong100m
10
  ---
11
 
12
+ # Model Card for InternVL-Chat-Chinese-V1.2-Plus
 
 
13
 
14
  \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
15
 
 
 
 
 
 
 
 
 
 
16
  <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
17
 
 
 
 
 
 
 
 
 
 
 
18
  ### Performance
19
 
20
  \* Proprietary Model
 
32
  | InternVL-Chat-V1.2-Plus | 448x448 | 50.3 | 45.6 | 59.9 | 83.8 | 82.0 | 58.7 | 1624/551 | 98.1\* | 88.7 | 71.3\* | 76.4 | - | 66.9 |
33
 
34
  - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
 
37
  ## Model Details
 
49
  - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
50
  - SFT Stage
51
  - Learnable Component: ViT + MLP + LLM
52
+ - Data: 12 million SFT samples.
53
 
54
 
55
  ## Model Usage
 
67
  from transformers import AutoModel, CLIPImageProcessor
68
  from transformers import AutoTokenizer
69
 
70
+ path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
71
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
72
  model = AutoModel.from_pretrained(
73
  path,