File size: 4,923 Bytes

---
license: apache-2.0
language:
- multilingual
base_model:
- Qwen/Qwen2-VL-2B-Instruct
tags:
- mmeb
- vidore
- colpali
- multimodal-embedding
---
### Ops-MM-embedding-v1-2B

**Ops-MM-embedding-v1-2B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.


### **Key Features**

#### Unified Multimodal Embeddings
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.

#### High Performance on MMEB
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).

#### Multilingual Capabilities
- The larger variant (**Ops-MM-embedding-v1-7B**) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.



### Training data

MMEB-train, CC-3M, colpali training set.


### Performance

#### MMEB-V2

| Model                    | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
| seed-1.6-embedding       | unknown        | 71.57   | 77.78         | 55.34         | 74.41          |
| Ops-MM-embedding-v1-7B   | 8.29           | 67.79   | 72.72         | 53.76         | 70.91          |
| Ops-MM-embedding-v1-2B   | 2.21           | 63.62   | 69.03         | 47.56         | 67.55          |
| VLM2Vec-V2.0-Qwen2VL-2B  | 2.21           | 58.39   | 64.85         | 34.85         | 66.34          |
| gme-Qwen2-VL-2B-Instruct | 2.21           | 54.37   | 51.89         | 33.86         | 73.47          |



#### MMEB-Image

The table below compares performance on MMEB-Image benchmark among models of similar size.

| Model                  | Model Size (B) | Image-Overall | I-CLS | I-QA  | I-RET | I-VG  |
| ---------------------- | -------------- | ------------- | ----- | ----- | ----- | ----- |
| Ops-MM-embedding-v1-2B | 2.21           | **69.03**     | 68.07 | 65.11 | 69.17 | 80.85 |
| B3_Qwen2_2B            | 2.21           | 68.1          | 67    | 61.19 | 70.85 | 79.88 |
| LLaVE-2B               | 1.95           | 65.2          | 62.1  | 60.2  | 65.2  | 84.9  |



#### ViDoRe-v2

| Model                  | Avg      | ESG Restaurant Human | MIT Bio | Econ. Macro | ESG Restaurant Synth. | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
| ---------------------- | -------- | -------------------- | ------- | ----------- | --------------------- | -------------- | ----------------- | ---------------------------- |
| gme-7B                 | 59.3     | 65.8                 | 64      | 62.9        | 54.3                  | 55.1           | 56.2              | 56.7                         |
| seed 1.6 embedding     | 58.9     | 63.3                 | 63.9    | 64.0        | 58.4                  | 57.1           | 53.8              | 52.0                         |
| Ops-MM-embedding-v1-7B | **60.6** | 66.3                 | 58.4    | 67.4        | 60.0                  | 54.3           | 60.9              | 56.8                         |
| Ops-MM-embedding-v1-2B | 54.4     | 58.6                 | 56.0    | 56.4        | 55.8                  | 52.9           | 47.9              | 53.4                         |




## Usage

```python
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image


model = OpsMMEmbeddingV1(
    "OpenSearch-AI/Ops-MM-embedding-v1-2B",
    device="cuda",
    attn_implementation="flash_attention_2"
)

t2i_prompt = "Find an image that matches the given text."
texts = [
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
    "Alibaba office.",
    "Alibaba office.",
]
images = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]

images = [fetch_image(image) for image in images]

# Text and image embedding
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Fused Embedding
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Multi-image embeddings
multi_images = [
    [images[0]],
    [images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())

```