Update README.md
Browse files
README.md
CHANGED
@@ -115,7 +115,7 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
|
|
115 |
|
116 |
## Quick Start
|
117 |
|
118 |
-
We provide an example code to run InternVL2-8B using `transformers`.
|
119 |
|
120 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
121 |
|
@@ -150,21 +150,6 @@ model = AutoModel.from_pretrained(
|
|
150 |
trust_remote_code=True).eval()
|
151 |
```
|
152 |
|
153 |
-
#### BNB 4-bit Quantization
|
154 |
-
|
155 |
-
```python
|
156 |
-
import torch
|
157 |
-
from transformers import AutoTokenizer, AutoModel
|
158 |
-
path = "OpenGVLab/InternVL2-8B"
|
159 |
-
model = AutoModel.from_pretrained(
|
160 |
-
path,
|
161 |
-
torch_dtype=torch.bfloat16,
|
162 |
-
load_in_4bit=True,
|
163 |
-
low_cpu_mem_usage=True,
|
164 |
-
use_flash_attn=True,
|
165 |
-
trust_remote_code=True).eval()
|
166 |
-
```
|
167 |
-
|
168 |
#### Multiple GPUs
|
169 |
|
170 |
The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
|
@@ -423,7 +408,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
|
|
423 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
424 |
print(f'User: {question}\nAssistant: {response}')
|
425 |
|
426 |
-
question = 'Describe this video in detail.
|
427 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
428 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
429 |
print(f'User: {question}\nAssistant: {response}')
|
|
|
115 |
|
116 |
## Quick Start
|
117 |
|
118 |
+
We provide an example code to run `InternVL2-8B` using `transformers`.
|
119 |
|
120 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
121 |
|
|
|
150 |
trust_remote_code=True).eval()
|
151 |
```
|
152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
#### Multiple GPUs
|
154 |
|
155 |
The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
|
|
|
408 |
num_patches_list=num_patches_list, history=None, return_history=True)
|
409 |
print(f'User: {question}\nAssistant: {response}')
|
410 |
|
411 |
+
question = 'Describe this video in detail.'
|
412 |
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
|
413 |
num_patches_list=num_patches_list, history=history, return_history=True)
|
414 |
print(f'User: {question}\nAssistant: {response}')
|