Alibaba-NLP
/

gme-Qwen2-VL-2B-Instruct

@@ -3604,8 +3604,7 @@ The `GME` models support three types of input: **text**, **image**, and **image-
 **Key Enhancements of GME Models**:
-- **Unified Multimodal Representation**: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation.
-- This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
 - **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (**MTEB**).
 - **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
 - **Strong Visual Retrieval Performance**: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots.
@@ -3699,12 +3698,17 @@ We will extend to multi-image input, image-text interleaved data as well as mult
 We encourage and value diverse applications of GME models and continuous enhancements to the models themselves.
-- If you distribute or make GME models (or any derivative works) available, or if you create a product or service (including another AI model) that incorporates them,
-  you must prominently display ``Built with GME'' on your website, user interface, blog post, ``About'' page, or product documentation.
-- If you utilize GME models or their outputs to develop, train, fine-tune, or improve an AI model that is distributed or made available, you must prefix the name of any such AI model with ``GME''.
 ## Citation
 If you find our paper or models helpful, please consider cite:

 **Key Enhancements of GME Models**:
+- **Unified Multimodal Representation**: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation. This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
 - **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (**MTEB**).
 - **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
 - **Strong Visual Retrieval Performance**: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots.
 We encourage and value diverse applications of GME models and continuous enhancements to the models themselves.
+- If you distribute or make GME models (or any derivative works) available, or if you create a product or service (including another AI model) that incorporates them, you must prominently display `Built with GME` on your website, user interface, blog post, About page, or product documentation.
+- If you utilize GME models or their outputs to develop, train, fine-tune, or improve an AI model that is distributed or made available, you must prefix the name of any such AI model with `GME`.
+## Cloud API Services
+In addition to the open-source [GME](https://huggingface.co/collections/Alibaba-NLP/gme-models) series models, GME series models are also available as commercial API services on Alibaba Cloud.
+- [MultiModal Embedding Models](https://help.aliyun.com/zh/model-studio/developer-reference/general-text-embedding/): The `multimodal-embedding-v1` model service is available.
+Note that the models behind the commercial APIs are not entirely identical to the open-source models.
 ## Citation
 If you find our paper or models helpful, please consider cite: