Salesforce
/

xgen-mm-phi3-mini-instruct-r-v1

@@ -8,11 +8,11 @@ pipeline_tag: image-text-to-text
 # Model description
-`BLIP3` is a series of foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. \
-These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. BLIP3 highlights a few features below,
-* The **pretrained** foundation model, `blip3-phi3-mini-base-r-v1`, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities.
-* The **instruct** fine-tuned model, `blip3-phi3-mini-instruct-r-v1`, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters.
 * `blip3-phi3-mini-instruct-r-v1` supports flexible high-resolution image encoding with efficient visual token sampling.
 More technical details will come with a technical report soon.
@@ -35,7 +35,7 @@ More technical details will come with a technical report soon.
 | MM1-3B      |    0 |       73.5 |         55.6 |           63.3 |         26.1 |          29.4 |             15.6 |            46.2 |
 |             |    4 |      112.3 |         99.7 |           84.1 |         48.6 |          45.3 |             38.0 |            57.9 |
 |             |    8 |      114.6 |        104.7 |           88.8 |         48.4 |          44.6 |             46.4 |            63.6 |
-| **blip3-phi3-mini-base-r-v1 (Ours)**|    0 |       **81.7** |         **80.2** |           60.7 |         **26.5** |          **36.0** |             **21.2** |            **48.1** |
 |             |    4 |      110.5 |        **101.7** |           **84.6** |         **49.2** |          **46.1** |             **38.4** |            **63.9** |
 |             |    8 |      112.1 |        104.4 |           87.7 |         **49.1** |          **46.4** |             44.3 |            **63.8** |
@@ -46,7 +46,7 @@ More technical details will come with a technical report soon.
 | openbmb/MiniCPM-V-2        | 67.1     | 69.6         | 1808      | -        | -       | -        | 38.2       | -        | 38.7             | -                | -         | -        |   |
 | VILA1.5-3B                 | 67.9     | 63.4         | -         | 1442     | -       | -        | 33.3       | 35.4     | -                | 69.0             | 85.9       | -        |   |
 | xtuner/llava-phi-3-mini-hf | 70.0     | 69.2         | 1790      | 1477     | 313     | 43.7     | **41.4**       | -        | -                | 73.7             | 87.3       | 69.3     |   |
-| **blip3-phi3-mini-instruct-r-v1 (Ours)** | **72.1**     | 74.1         | **1827**      | 1467     | **360**     | **44.6**     | 39.8       | **45.1**     | **39.3**             | **74.2**             | 87.2       | **75.8**     |   |
 # How to use
@@ -130,9 +130,9 @@ Our code and weights are released under the Creative Commons Attribution Non Com
 # Citation
 ```
-@misc{blip3_phi3_mini,
-    title={BLIP3-phi3-mini-instruct Model Card},
-    url={https://huggingface.co/Salesforce/blip3-phi3-mini-instruct-r-v1},
     author={Salesforce AI Research},
     month={May},
     year={2024}

 # Model description
+`XGen-MM` is a series of foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. \
+These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. XGen-MM highlights a few features below,
+* The **pretrained** foundation model, `xgen-mm-phi3-mini-base-r-v1`, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities.
+* The **instruct** fine-tuned model, `xgen-mm-phi3-mini-instruct-r-v1`, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters.
 * `blip3-phi3-mini-instruct-r-v1` supports flexible high-resolution image encoding with efficient visual token sampling.
 More technical details will come with a technical report soon.
 | MM1-3B      |    0 |       73.5 |         55.6 |           63.3 |         26.1 |          29.4 |             15.6 |            46.2 |
 |             |    4 |      112.3 |         99.7 |           84.1 |         48.6 |          45.3 |             38.0 |            57.9 |
 |             |    8 |      114.6 |        104.7 |           88.8 |         48.4 |          44.6 |             46.4 |            63.6 |
+| **xgen-mm-phi3-mini-base-r-v1 (Ours)**|    0 |       **81.7** |         **80.2** |           60.7 |         **26.5** |          **36.0** |             **21.2** |            **48.1** |
 |             |    4 |      110.5 |        **101.7** |           **84.6** |         **49.2** |          **46.1** |             **38.4** |            **63.9** |
 |             |    8 |      112.1 |        104.4 |           87.7 |         **49.1** |          **46.4** |             44.3 |            **63.8** |
 | openbmb/MiniCPM-V-2        | 67.1     | 69.6         | 1808      | -        | -       | -        | 38.2       | -        | 38.7             | -                | -         | -        |   |
 | VILA1.5-3B                 | 67.9     | 63.4         | -         | 1442     | -       | -        | 33.3       | 35.4     | -                | 69.0             | 85.9       | -        |   |
 | xtuner/llava-phi-3-mini-hf | 70.0     | 69.2         | 1790      | 1477     | 313     | 43.7     | **41.4**       | -        | -                | 73.7             | 87.3       | 69.3     |   |
+| **xgen-mm-phi3-mini-instruct-r-v1 (Ours)** | **72.1**     | 74.1         | **1827**      | 1467     | **360**     | **44.6**     | 39.8       | **45.1**     | **39.3**             | **74.2**             | 87.2       | **75.8**     |   |
 # How to use
 # Citation
 ```
+@misc{xgen_mm_phi3_mini,
+    title={xgen-mm-phi3-mini-instruct Model Card},
+    url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
     author={Salesforce AI Research},
     month={May},
     year={2024}