File size: 4,764 Bytes
52b6894 b793fdd 52b6894 6bfce3d 42899a2 6bfce3d 42899a2 6bfce3d 42899a2 6bfce3d 42899a2 6bfce3d 42899a2 6bfce3d 42899a2 6bfce3d 42899a2 6bfce3d 42899a2 26d80df 42899a2 ad55950 6bfce3d 42899a2 ad55950 6bfce3d 42899a2 26d80df 42899a2 52b6894 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
license: apache-2.0
language:
- multilingual
base_model:
- Qwen/Qwen2-VL-2B-Instruct
tags:
- mmeb
- vidore
- colpali
- multimodal-embedding
pipeline_tag: feature-extraction
---
# Ops-MM-embedding-v1-2B
**Ops-MM-embedding-v1-2B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.
## **Key Features**
### Unified Multimodal Embeddings
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.
### High Performance on MMEB
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03).
### Multilingual Capabilities
- The larger variant (**Ops-MM-embedding-v1-7B**) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.
## Training data
MMEB-train, CC-3M, colpali training set.
## Performance
### MMEB-V2
| Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall |
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- |
| seed-1.6-embedding | unknown | 71.27 | 77.78 | 55.34 | 73.44 |
| Ops-MM-embedding-v1-7B | 8.29 | 67.61 | 72.72 | 53.76 | 70.34 |
| Ops-MM-embedding-v1-2B | 2.21 | 63.44 | 69.03 | 47.56 | 66.96 |
| VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 58.02 | 64.85 | 34.85 | 65.36 |
| gme-Qwen2-VL-7B-Instruct | 8.29 | 57.83 | 55.95 | 38.43 | 75.18 |
| gme-Qwen2-VL-2B-Instruct | 2.21 | 54.08 | 51.89 | 33.64 | 72.71 |
### MMEB-Image
The table below compares performance on MMEB-Image benchmark among models of similar size.
| Model | Model Size (B) | Image-Overall | I-CLS | I-QA | I-RET | I-VG |
| ---------------------- | -------------- | ------------- | ----- | ----- | ----- | ----- |
| Ops-MM-embedding-v1-2B | 2.21 | **69.03** | 68.07 | 65.11 | 69.17 | 80.85 |
| B3_Qwen2_2B | 2.21 | 68.1 | 67 | 61.19 | 70.85 | 79.88 |
| LLaVE-2B | 1.95 | 65.2 | 62.1 | 60.2 | 65.2 | 84.9 |
### ViDoRe-v2
| Model | Avg | ESG Restaurant Human | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. |
| ---------------------- | --------- | -------------------- | -------------- | ----------------- | ---------------------------- |
| gme-7B | 55.61 | 63.37 | 49.49 | 54.21 | 55.38 |
| seed 1.6 embedding | 56.57 | 63.3 | 57.14 | 53.85 | 51.99 |
| Ops-MM-embedding-v1-7B | **59.59** | 66.27 | 54.34 | 60.92 | 56.82 |
| Ops-MM-embedding-v1-2B | 53.18 | 58.57 | 52.87 | 47.89 | 53.39 |
## Usage
```python
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image
model = OpsMMEmbeddingV1(
"OpenSearch-AI/Ops-MM-embedding-v1-2B",
device="cuda",
attn_implementation="flash_attention_2"
)
t2i_prompt = "Find an image that matches the given text."
texts = [
"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
"Alibaba office.",
"Alibaba office.",
]
images = [
"https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
"https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]
images = [fetch_image(image) for image in images]
# Text and image embedding
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())
# Fused Embedding
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())
# Multi-image embeddings
multi_images = [
[images[0]],
[images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())
``` |