File size: 5,148 Bytes
60efc1a
 
 
 
 
 
 
 
 
 
 
79fbd6f
60efc1a
5ff03c9
d135479
 
 
 
5ff03c9
d135479
5ff03c9
d135479
 
5ff03c9
d135479
 
5ff03c9
d135479
 
 
 
5ff03c9
d135479
 
 
 
5ff03c9
d135479
5ff03c9
d135479
 
 
5bfa639
 
 
 
 
 
d135479
 
5ff03c9
d135479
 
 
 
 
 
 
 
 
 
 
 
 
5ff03c9
d135479
5bfa639
 
 
 
 
 
d135479
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60efc1a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: apache-2.0
language:
- multilingual
base_model:
- Qwen/Qwen2-VL-7B-Instruct
tags:
- mmeb
- vidore
- colpali
- multimodal-embedding
pipeline_tag: feature-extraction
---
# Ops-MM-embedding-v1-7B

**Ops-MM-embedding-v1-7B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.


## **Key Features**

### Unified Multimodal Embeddings
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.

### High Performance on MMEB
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).

### Multilingual Capabilities
- **Ops-MM-embedding-v1-7B** achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.



## Training data

MMEB-train, CC-3M, colpali training set.


## Performance

### MMEB-V2

| Model                    | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
| seed-1.6-embedding       | unknown        | 71.27   | 77.78         | 55.34         | 73.44          |
| Ops-MM-embedding-v1-7B   | 8.29           | 67.61   | 72.72         | 53.76         | 70.34          |
| Ops-MM-embedding-v1-2B   | 2.21           | 63.44   | 69.03         | 47.56         | 66.96          |
| VLM2Vec-V2.0-Qwen2VL-2B  | 2.21           | 58.02   | 64.85         | 34.85         | 65.36          |
| gme-Qwen2-VL-7B-Instruct | 8.29           | 57.83   | 55.95         | 38.43         | 75.18          |
| gme-Qwen2-VL-2B-Instruct | 2.21           | 54.08   | 51.89         | 33.64         | 72.71          |


### MMEB-Image

The table below compares performance on MMEB-Image benchmark among models of similar size.

| Models                                | Model Size(B) | Image-Overall | I-CLS | I-QA  | I-RET  | I-VG   |
| ------------------------------------- | ------------- | ------------- | ----- | ----- | ------ | ------ |
| Ops-MM-embedding-v1-7B                | 8.29          | **72.72**     | 69.65 | 69.58 | 73.09  | 87.15  |
| QQMM-embed                            | 8.297         | 72.175        | 70.07 | 69.52 | 71.175 | 87.075 |
| B3_Qwen2_7B                           | 8.29          | 72            | 70    | 66.5  | 74.1   | 84.6   |
| UniME(LLaVA-OneVision-7B-LoRA-Res336) | 8.03          | 70.7          | 66.8  | 66.6  | 70.5   | 90.9   |
| LLaVE-7B                              | 8.03          | 70.3          | 65.7  | 65.4  | 70.9   | 91.9   |
| UNITE-Instruct-7B                     | 8.29          | 70.3          | 68.3  | 65.1  | 71.6   | 84.8   |


### ViDoRe-v2

| Model                  | Avg       | ESG Restaurant Human | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
| ---------------------- | --------- | -------------------- | -------------- | ----------------- | ---------------------------- |
| gme-7B                 | 55.61     | 63.37                | 49.49          | 54.21             | 55.38                        |
| seed 1.6 embedding     | 56.57     | 63.3                 | 57.14          | 53.85             | 51.99                        |
| Ops-MM-embedding-v1-7B | **59.59** | 66.27                | 54.34          | 60.92             | 56.82                        |
| Ops-MM-embedding-v1-2B | 53.18     | 58.57                | 52.87          | 47.89             | 53.39                        |



## Usage

```python
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image


model = OpsMMEmbeddingV1(
    "OpenSearch-AI/Ops-MM-embedding-v1-7B",
    device="cuda",
    attn_implementation="flash_attention_2"
)

t2i_prompt = "Find an image that matches the given text."
texts = [
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
    "Alibaba office.",
    "Alibaba office.",
]
images = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]

images = [fetch_image(image) for image in images]

# Text and image embedding
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Fused Embedding
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Multi-image embeddings
multi_images = [
    [images[0]],
    [images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())

```