File size: 4,764 Bytes
52b6894
 
 
 
 
 
 
 
 
 
 
b793fdd
52b6894
6bfce3d
42899a2
 
 
 
6bfce3d
42899a2
6bfce3d
42899a2
 
6bfce3d
42899a2
 
6bfce3d
42899a2
 
 
6bfce3d
42899a2
 
 
 
6bfce3d
42899a2
6bfce3d
42899a2
 
 
26d80df
 
 
 
 
 
42899a2
ad55950
6bfce3d
42899a2
 
 
 
 
 
 
 
 
ad55950
6bfce3d
42899a2
 
26d80df
 
 
 
 
 
42899a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52b6894
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
language:
- multilingual
base_model:
- Qwen/Qwen2-VL-2B-Instruct
tags:
- mmeb
- vidore
- colpali
- multimodal-embedding
pipeline_tag: feature-extraction
---
# Ops-MM-embedding-v1-2B

**Ops-MM-embedding-v1-2B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.


## **Key Features**

### Unified Multimodal Embeddings
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.

### High Performance on MMEB
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).

### Multilingual Capabilities
- The larger variant (**Ops-MM-embedding-v1-7B**) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.


## Training data

MMEB-train, CC-3M, colpali training set.


## Performance

### MMEB-V2

| Model                    | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
| seed-1.6-embedding       | unknown        | 71.27   | 77.78         | 55.34         | 73.44          |
| Ops-MM-embedding-v1-7B   | 8.29           | 67.61   | 72.72         | 53.76         | 70.34          |
| Ops-MM-embedding-v1-2B   | 2.21           | 63.44   | 69.03         | 47.56         | 66.96          |
| VLM2Vec-V2.0-Qwen2VL-2B  | 2.21           | 58.02   | 64.85         | 34.85         | 65.36          |
| gme-Qwen2-VL-7B-Instruct | 8.29           | 57.83   | 55.95         | 38.43         | 75.18          |
| gme-Qwen2-VL-2B-Instruct | 2.21           | 54.08   | 51.89         | 33.64         | 72.71          |


### MMEB-Image

The table below compares performance on MMEB-Image benchmark among models of similar size.

| Model                  | Model Size (B) | Image-Overall | I-CLS | I-QA  | I-RET | I-VG  |
| ---------------------- | -------------- | ------------- | ----- | ----- | ----- | ----- |
| Ops-MM-embedding-v1-2B | 2.21           | **69.03**     | 68.07 | 65.11 | 69.17 | 80.85 |
| B3_Qwen2_2B            | 2.21           | 68.1          | 67    | 61.19 | 70.85 | 79.88 |
| LLaVE-2B               | 1.95           | 65.2          | 62.1  | 60.2  | 65.2  | 84.9  |


### ViDoRe-v2


| Model                  | Avg       | ESG Restaurant Human | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
| ---------------------- | --------- | -------------------- | -------------- | ----------------- | ---------------------------- |
| gme-7B                 | 55.61     | 63.37                | 49.49          | 54.21             | 55.38                        |
| seed 1.6 embedding     | 56.57     | 63.3                 | 57.14          | 53.85             | 51.99                        |
| Ops-MM-embedding-v1-7B | **59.59** | 66.27                | 54.34          | 60.92             | 56.82                        |
| Ops-MM-embedding-v1-2B | 53.18     | 58.57                | 52.87          | 47.89             | 53.39                        |


## Usage

```python
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image


model = OpsMMEmbeddingV1(
    "OpenSearch-AI/Ops-MM-embedding-v1-2B",
    device="cuda",
    attn_implementation="flash_attention_2"
)

t2i_prompt = "Find an image that matches the given text."
texts = [
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
    "Alibaba office.",
    "Alibaba office.",
]
images = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]

images = [fetch_image(image) for image in images]

# Text and image embedding
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Fused Embedding
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Multi-image embeddings
multi_images = [
    [images[0]],
    [images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())

```