MILVLG
/

Imp-v1.5-3B-Phi2

Text Generation

Transformers

Safetensors

imp

custom_code

Model card Files Files and versions Community

Oyoy1235 commited on May 22

Commit

5f3a7bd

•

1 Parent(s): 8526fbb

update readme

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ datasets:
 ## Introduction
-The Imp project aims to provide a family of  a strong multimodal `small` language models (MSLMs). Our `Imp-v1.5-3B-Phi2` is a strong MSLM with only **3B** parameters, which is build upon a small yet powerful SLM [Phi-2 ](https://huggingface.co/microsoft/phi-2)(2.7B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
 As shown in the Table below, `Imp-v1.5-3B-Phi2` significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.
@@ -26,7 +26,7 @@ pip install transformers # latest version is ok, but we recommend v4.36.0
 pip install -q pillow accelerate einops
 ```
-You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). A Colab page to run this example is provided [here](https://colab.research.google.com/drive/1EBYky6xIPjnlPppo2gZaiNK6gEsjXgom?usp=drive_link#scrollTo=2-VpU6QzWCVZ). Note that the example can only be run on GPUs currently.
 ```Python
 import torch
@@ -69,8 +69,8 @@ We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA be
 | [LaVA-Phi-3B](https://github.com/zhuyiche/llava-phi) | 3B | 71.4  | - |    68.4   |    48.6  | 85.0 | 1335.1 | 59.8 |-|28.9|
 | [MobileVLM-3B](https://huggingface.co/mtgv/MobileVLM-3B) | 3B | - | 59.0  |    61.0   |    47.5   | 84.9 | 1288.9 | 59.6 |- |-|
 | [MiniCPM-V-3B](https://huggingface.co/mtgv/MobileVLM-3B) | 3B | - |-  | - | - | - |  1452.0 |  67.9 | **65.3**|-|
-| [Bunny-3B](https://huggingface.co/visheratin/MC-LLaVA-3b) | 3B |  79.8 |  62.5  |   70.9  |    -  | 86.8| 1488.8 | 68.6 |- |-|
-| **Imp-v1.5-3B-Phi2** | 3B | **81.2**  | **63.5** | **72.8**| **59.8** | **88.9**| **1446.4** | **72.9**| 46.7 |**43.3**|
 ## License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.

 ## Introduction
+The Imp project aims to provide a family of strong multimodal lightweight LMMs. Our `Imp-v1.5-3B-Phi2` is a strong MSLM with only **3B** parameters, which is build upon a small yet powerful SLM [Phi-2 ](https://huggingface.co/microsoft/phi-2)(2.7B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
 As shown in the Table below, `Imp-v1.5-3B-Phi2` significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.
 pip install -q pillow accelerate einops
 ```
+You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). Note that the example can only be run on GPUs currently.
 ```Python
 import torch
 | [LaVA-Phi-3B](https://github.com/zhuyiche/llava-phi) | 3B | 71.4  | - |    68.4   |    48.6  | 85.0 | 1335.1 | 59.8 |-|28.9|
 | [MobileVLM-3B](https://huggingface.co/mtgv/MobileVLM-3B) | 3B | - | 59.0  |    61.0   |    47.5   | 84.9 | 1288.9 | 59.6 |- |-|
 | [MiniCPM-V-3B](https://huggingface.co/mtgv/MobileVLM-3B) | 3B | - |-  | - | - | - |  1452.0 |  67.9 | **65.3**|-|
+| [Bunny-3B](https://huggingface.co/visheratin/MC-LLaVA-3b) | 3B |  79.8 |  62.5  |   70.9  |    -  | 86.8| **1488.8** | 68.6 |- |-|
+| **Imp-v1.5-3B-Phi2** | 3B | **81.2**  | **63.5** | **72.8**| **59.8** | **88.9**| 1446.4 | **72.9**| 46.7 |**43.3**|
 ## License
 This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.