File size: 5,148 Bytes
60efc1a 79fbd6f 60efc1a 5ff03c9 d135479 5ff03c9 d135479 5ff03c9 d135479 5ff03c9 d135479 5ff03c9 d135479 5ff03c9 d135479 5ff03c9 d135479 5ff03c9 d135479 5bfa639 d135479 5ff03c9 d135479 5ff03c9 d135479 5bfa639 d135479 60efc1a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: apache-2.0
language:
- multilingual
base_model:
- Qwen/Qwen2-VL-7B-Instruct
tags:
- mmeb
- vidore
- colpali
- multimodal-embedding
pipeline_tag: feature-extraction
---
# Ops-MM-embedding-v1-7B
**Ops-MM-embedding-v1-7B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
## **Key Features**
### Unified Multimodal Embeddings
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
### High Performance on MMEB
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).
### Multilingual Capabilities
- **Ops-MM-embedding-v1-7B** achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
## Training data
MMEB-train, CC-3M, colpali training set.
## Performance
### MMEB-V2
| Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
| seed-1.6-embedding | unknown | 71.27 | 77.78 | 55.34 | 73.44 |
| Ops-MM-embedding-v1-7B | 8.29 | 67.61 | 72.72 | 53.76 | 70.34 |
| Ops-MM-embedding-v1-2B | 2.21 | 63.44 | 69.03 | 47.56 | 66.96 |
| VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 58.02 | 64.85 | 34.85 | 65.36 |
| gme-Qwen2-VL-7B-Instruct | 8.29 | 57.83 | 55.95 | 38.43 | 75.18 |
| gme-Qwen2-VL-2B-Instruct | 2.21 | 54.08 | 51.89 | 33.64 | 72.71 |
### MMEB-Image
The table below compares performance on MMEB-Image benchmark among models of similar size.
| Models | Model Size(B) | Image-Overall | I-CLS | I-QA | I-RET | I-VG |
| ------------------------------------- | ------------- | ------------- | ----- | ----- | ------ | ------ |
| Ops-MM-embedding-v1-7B | 8.29 | **72.72** | 69.65 | 69.58 | 73.09 | 87.15 |
| QQMM-embed | 8.297 | 72.175 | 70.07 | 69.52 | 71.175 | 87.075 |
| B3_Qwen2_7B | 8.29 | 72 | 70 | 66.5 | 74.1 | 84.6 |
| UniME(LLaVA-OneVision-7B-LoRA-Res336) | 8.03 | 70.7 | 66.8 | 66.6 | 70.5 | 90.9 |
| LLaVE-7B | 8.03 | 70.3 | 65.7 | 65.4 | 70.9 | 91.9 |
| UNITE-Instruct-7B | 8.29 | 70.3 | 68.3 | 65.1 | 71.6 | 84.8 |
### ViDoRe-v2
| Model | Avg | ESG Restaurant Human | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
| ---------------------- | --------- | -------------------- | -------------- | ----------------- | ---------------------------- |
| gme-7B | 55.61 | 63.37 | 49.49 | 54.21 | 55.38 |
| seed 1.6 embedding | 56.57 | 63.3 | 57.14 | 53.85 | 51.99 |
| Ops-MM-embedding-v1-7B | **59.59** | 66.27 | 54.34 | 60.92 | 56.82 |
| Ops-MM-embedding-v1-2B | 53.18 | 58.57 | 52.87 | 47.89 | 53.39 |
## Usage
```python
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image
model = OpsMMEmbeddingV1(
"OpenSearch-AI/Ops-MM-embedding-v1-7B",
device="cuda",
attn_implementation="flash_attention_2"
)
t2i_prompt = "Find an image that matches the given text."
texts = [
"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
"Alibaba office.",
"Alibaba office.",
]
images = [
"https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
"https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]
images = [fetch_image(image) for image in images]
# Text and image embedding
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())
# Fused Embedding
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())
# Multi-image embeddings
multi_images = [
[images[0]],
[images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())
``` |