OpenSearch-AI
/

Ops-MM-embedding-v1-2B

@@ -10,32 +10,31 @@ tags:
 - colpali
 - multimodal-embedding
 ---
-### Ops-MM-embedding-v1-2B
 **Ops-MM-embedding-v1-2B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
-### **Key Features**
-#### Unified Multimodal Embeddings
 - Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
-#### High Performance on MMEB
 - Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).
-#### Multilingual Capabilities
 - The larger variant (**Ops-MM-embedding-v1-7B**) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
-### Training data
 MMEB-train, CC-3M, colpali training set.
-### Performance
-#### MMEB-V2
 | Model                    | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
 | ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
@@ -46,8 +45,7 @@ MMEB-train, CC-3M, colpali training set.
 | gme-Qwen2-VL-2B-Instruct | 2.21           | 54.37   | 51.89         | 33.86         | 73.47          |
-#### MMEB-Image
 The table below compares performance on MMEB-Image benchmark among models of similar size.
@@ -58,8 +56,7 @@ The table below compares performance on MMEB-Image benchmark among models of sim
 | LLaVE-2B               | 1.95           | 65.2          | 62.1  | 60.2  | 65.2  | 84.9  |
-#### ViDoRe-v2
 | Model                  | Avg      | ESG Restaurant Human | MIT Bio | Econ. Macro | ESG Restaurant Synth. | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
 | ---------------------- | -------- | -------------------- | ------- | ----------- | --------------------- | -------------- | ----------------- | ---------------------------- |

 - colpali
 - multimodal-embedding
 ---
+# Ops-MM-embedding-v1-2B
 **Ops-MM-embedding-v1-2B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
+## **Key Features**
+### Unified Multimodal Embeddings
 - Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
+### High Performance on MMEB
 - Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).
+### Multilingual Capabilities
 - The larger variant (**Ops-MM-embedding-v1-7B**) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
+## Training data
 MMEB-train, CC-3M, colpali training set.
+## Performance
+### MMEB-V2
 | Model                    | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
 | ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
 | gme-Qwen2-VL-2B-Instruct | 2.21           | 54.37   | 51.89         | 33.86         | 73.47          |
+### MMEB-Image
 The table below compares performance on MMEB-Image benchmark among models of similar size.
 | LLaVE-2B               | 1.95           | 65.2          | 62.1  | 60.2  | 65.2  | 84.9  |
+### ViDoRe-v2
 | Model                  | Avg      | ESG Restaurant Human | MIT Bio | Econ. Macro | ESG Restaurant Synth. | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
 | ---------------------- | -------- | -------------------- | ------- | ----------- | --------------------- | -------------- | ----------------- | ---------------------------- |