update v1

Browse files

Files changed (13) hide show

README.md +11 -9
images/bird.jpg +0 -0
images/evaluation.png +0 -0
model-00001-of-00007.safetensors +1 -1
model-00002-of-00007.safetensors +1 -1
model-00003-of-00007.safetensors +1 -1
model-00004-of-00007.safetensors +1 -1
model-00005-of-00007.safetensors +1 -1
model-00006-of-00007.safetensors +1 -1
model-00007-of-00007.safetensors +2 -2
model.safetensors.index.json +2 -4
test.py +29 -0
vision_encoder.py +1 -0

README.md CHANGED Viewed

@@ -18,7 +18,9 @@ datasets:
 The Imp project aims to provide a family of  a strong multimodal `small` language models (MSLMs). Our `imp-v1-3b` is a strong MSLM with only **3B** parameters, which is build upon a small yet powerful SLM [Phi-2 ](https://huggingface.co/microsoft/phi-2)(2.7B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on the [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA) training set.
-As shown in the Table below, `imp-v1-3b` significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.
 We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
@@ -68,14 +70,14 @@ print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True
 ## Model evaluation
 We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
-| Models | Size | VQAv2 | GQA |VizWiz  | SQA(IMG) | TextVQA | POPE |  MME(P) | MMB  |MM-Vet|
-|:--------:|:-----:|:----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|
-| [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | **63.00** |47.80 |  68.40 |58.20| 86.40 | **1476.9** | 66.10  |30.2|
-| [TinyGPT-V](https://huggingface.co/Tyrannosaurus/TinyGPT-V) | 3B | - | 33.60  | 24.80  |    -   |    -  | -| - | -  |-|
-| [LLaVA-Phi](https://github.com/zhuyiche/llava-phi) | 3B | 71.40  | - | 35.90 |    68.40   |    48.60  | 85.00 | 1335.1 | 59.80 |28.9|
-| [MobileVLM](https://huggingface.co/mtgv/MobileVLM-3B) | 3B | - | 59.00  | - |    61.00   |    47.50   | 84.90 | 1288.9 | 59.60  |-|
-| [MC-LLaVA-3b](https://huggingface.co/visheratin/MC-LLaVA-3b) | 3B | 64.24 | 49.60  | 24.88 |    -   |    38.59   | 80.59 | - | -  |-|
-| **Imp-v1 (ours)** | 3B | **79.45**  | 58.55 | **50.09** |**69.96**| **59.38** | **88.02**| 1434.0 | **66.49**  |**33.1**|
 ### Examples

 The Imp project aims to provide a family of  a strong multimodal `small` language models (MSLMs). Our `imp-v1-3b` is a strong MSLM with only **3B** parameters, which is build upon a small yet powerful SLM [Phi-2 ](https://huggingface.co/microsoft/phi-2)(2.7B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on the [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA) training set.
+As shown in the image below, `imp-v1-3b` significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.
+![evaluation](images/evaluation.png)
 We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
 ## Model evaluation
 We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
+| Models | Size | VQAv2 | GQA | SQA(IMG) | TextVQA | POPE |  MME(P) | MMB  |MM-Vet|
+|:--------:|:-----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|
+| [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | 63.00|  68.40 |58.20| 86.40 | 1476.9 | 66.10  |30.2|
+| [TinyGPT-V](https://huggingface.co/Tyrannosaurus/TinyGPT-V) | 3B | - | 33.60  |    -   |    -  | -| - | -  |-|
+| [LLaVA-Phi](https://github.com/zhuyiche/llava-phi) | 3B | 71.40  | - |    68.40   |    48.60  | 85.00 | 1335.1 | 59.80 |28.9|
+| [MobileVLM](https://huggingface.co/mtgv/MobileVLM-3B) | 3B | - | 59.00  |    61.00   |    47.50   | 84.90 | 1288.9 | 59.60  |-|
+| [MC-LLaVA-3b](https://huggingface.co/visheratin/MC-LLaVA-3b) | 3B | 64.24 | 49.60  |   -   |    38.59   | 80.59 | - | -  |-|
+| **Imp-v1 (ours)** | 3B | **81.42**  | **64.40** | **69.26**| **59.34** | **87.85**| **1502.8** | **67.69**  |**33.6**|
 ### Examples

images/bird.jpg ADDED Viewed

images/evaluation.png ADDED Viewed

model-00001-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8dda1c1a0d4d6c4f49dbc299ec670eac0afb16fb835e1d630da79c7645127391
 size 996428776

 version https://git-lfs.github.com/spec/v1
+oid sha256:166a9e057252c25fa569d6337d171f4ad9fa5215ca066ce2689db968b59a1aeb
 size 996428776

model-00002-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d58fa1f55eb7fa6e26aba8c1c1b07aac5828d8b0e785f3d8db3baf0a68921515
 size 996507088

 version https://git-lfs.github.com/spec/v1
+oid sha256:ac8936c5f7c1992673ff7c56841430b064a0a5874821c2df9a3f6f5d7713df9d
 size 996507088

model-00003-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bfdb91f160d233485587db83174b0ca7e4d1134f77f9d4736f1786f7d277a17f
 size 996512312

 version https://git-lfs.github.com/spec/v1
+oid sha256:75f41aff80125fc02600c56583b630ce414c9f4c9c15002e4db2aa103110446d
 size 996512312

model-00004-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:af73c3e2d728c6220d3a83fc5b9aeae155573aa66014bc4f6728196d9956ecb0
 size 996512088

 version https://git-lfs.github.com/spec/v1
+oid sha256:f0008d0798eb81eea3c915894230d3aa7854b18f492344715c10ef4b80919e85
 size 996512088

model-00005-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:890a1a0b4217bd1ba6bba4bbab4d51197acd0c1d801398a3746758e49c83c185
 size 996507152

 version https://git-lfs.github.com/spec/v1
+oid sha256:5d692152683b3597b34215209270cd9ece3bde5340883a6649e2a226674b84fc
 size 996507152

model-00006-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e2f1a9c2298d343eacf49b88280e995979a68276f653f374e290f9a4a2a512fe
 size 1021447256

 version https://git-lfs.github.com/spec/v1
+oid sha256:b37dd972e5111ccb1609e70215e5b9806d12dd573ab23f54f790252e715f4390
 size 1021447256

model-00007-of-00007.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1eaf13f8bbc743e8db70ac104cddd55b0dacbb55b1a65c647f35fe9064a3ca76
-size 370065920

 version https://git-lfs.github.com/spec/v1
+oid sha256:93604db012439a7fdd71718433aac7941daf7eeafcb6fb3298aee6e0a15c08c9
+size 370061024

model.safetensors.index.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "metadata": {
-    "total_size": 6373878848
   },
   "weight_map": {
     "lm_head.linear.bias": "model-00007-of-00007.safetensors",
@@ -750,8 +750,6 @@
     "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
     "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
     "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
-    "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
-    "transformer.vision_tower.vision_tower.vision_model.post_layernorm.bias": "model-00007-of-00007.safetensors",
-    "transformer.vision_tower.vision_tower.vision_model.post_layernorm.weight": "model-00007-of-00007.safetensors"
   }
 }

 {
   "metadata": {
+    "total_size": 6373874240
   },
   "weight_map": {
     "lm_head.linear.bias": "model-00007-of-00007.safetensors",
     "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00006-of-00007.safetensors",
     "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
     "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00006-of-00007.safetensors",
+    "transformer.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00006-of-00007.safetensors"
   }
 }

test.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from PIL import Image
+torch.set_default_device("cuda")
+#Create model
+model = AutoModelForCausalLM.from_pretrained(
+    "/data/ouyangxc/labs/hg/imp-v1-3b",
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("/data/ouyangxc/labs/hg/imp-v1-3b", trust_remote_code=True)
+#Set inputs
+text = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's the color of the car? ASSISTANT:"
+image = Image.open("images/car.jpg")
+input_ids = tokenizer(text, return_tensors='pt').input_ids
+image_tensor = model.image_preprocess(image)
+#Generate the answer
+output_ids = model.generate(
+    input_ids,
+    max_new_tokens=150,
+    images=image_tensor,
+    use_cache=True)[0]
+print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

vision_encoder.py CHANGED Viewed

@@ -549,6 +549,7 @@ class VisionTower(nn.Module):
         self.vision_tower = SiglipVisionModel(self.config)
         del self.vision_tower.vision_model.encoder.layers[(self.select_layer + 1):]
         self.vision_tower.vision_model.head = nn.Identity()
         self.vision_tower.requires_grad_(False)
         self.vision_tower.eval()

         self.vision_tower = SiglipVisionModel(self.config)
         del self.vision_tower.vision_model.encoder.layers[(self.select_layer + 1):]
         self.vision_tower.vision_model.head = nn.Identity()
+        self.vision_tower.vision_model.post_layernorm=nn.Identity()
         self.vision_tower.requires_grad_(False)
         self.vision_tower.eval()