OpenGVLab
/

InternVL2-8B

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

czczup commited on 18 days ago

Commit

9c42ea1

•

1 Parent(s): 42b7cef

Update README.md

Files changed (1) hide show

README.md +2 -17

README.md CHANGED Viewed

@@ -115,7 +115,7 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
 ## Quick Start
-We provide an example code to run InternVL2-8B using `transformers`.
 > Please use transformers>=4.37.2 to ensure the model works normally.
@@ -150,21 +150,6 @@ model = AutoModel.from_pretrained(
     trust_remote_code=True).eval()
 ```
-#### BNB 4-bit Quantization
-```python
-import torch
-from transformers import AutoTokenizer, AutoModel
-path = "OpenGVLab/InternVL2-8B"
-model = AutoModel.from_pretrained(
-    path,
-    torch_dtype=torch.bfloat16,
-    load_in_4bit=True,
-    low_cpu_mem_usage=True,
-    use_flash_attn=True,
-    trust_remote_code=True).eval()
-```
 #### Multiple GPUs
 The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
@@ -423,7 +408,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
                                num_patches_list=num_patches_list, history=None, return_history=True)
 print(f'User: {question}\nAssistant: {response}')
-question = 'Describe this video in detail. Don\'t repeat.'
 response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                                num_patches_list=num_patches_list, history=history, return_history=True)
 print(f'User: {question}\nAssistant: {response}')

 ## Quick Start
+We provide an example code to run `InternVL2-8B` using `transformers`.
 > Please use transformers>=4.37.2 to ensure the model works normally.
     trust_remote_code=True).eval()
 ```
 #### Multiple GPUs
 The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
                                num_patches_list=num_patches_list, history=None, return_history=True)
 print(f'User: {question}\nAssistant: {response}')
+question = 'Describe this video in detail.'
 response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                                num_patches_list=num_patches_list, history=history, return_history=True)
 print(f'User: {question}\nAssistant: {response}')