|
--- |
|
license: apache-2.0 |
|
language: |
|
- multilingual |
|
base_model: |
|
- Qwen/Qwen2-VL-2B-Instruct |
|
tags: |
|
- mmeb |
|
- vidore |
|
- colpali |
|
- multimodal-embedding |
|
--- |
|
### Ops-MM-embedding-v1-2B |
|
|
|
**Ops-MM-embedding-v1-2B** is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL. |
|
|
|
|
|
### **Key Features** |
|
|
|
#### Unified Multimodal Embeddings |
|
- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval. |
|
|
|
#### High Performance on MMEB |
|
- Achieves **SOTA results** among models of similar scale on **MMEB-V2** and **MMEB-Image** benchmark (until 2025-07-03). |
|
|
|
#### Multilingual Capabilities |
|
- The larger variant (**Ops-MM-embedding-v1-7B**) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization. |
|
|
|
|
|
|
|
### Training data |
|
|
|
MMEB-train, CC-3M, colpali training set. |
|
|
|
|
|
### Performance |
|
|
|
#### MMEB-V2 |
|
|
|
| Model | Model Size (B) | Overall | Image-Overall | Video-Overall | Visdoc-Overall | |
|
| ------------------------ | -------------- | ------- | ------------- | ------------- | -------------- | |
|
| seed-1.6-embedding | unknown | 71.57 | 77.78 | 55.34 | 74.41 | |
|
| Ops-MM-embedding-v1-7B | 8.29 | 67.79 | 72.72 | 53.76 | 70.91 | |
|
| Ops-MM-embedding-v1-2B | 2.21 | 63.62 | 69.03 | 47.56 | 67.55 | |
|
| VLM2Vec-V2.0-Qwen2VL-2B | 2.21 | 58.39 | 64.85 | 34.85 | 66.34 | |
|
| gme-Qwen2-VL-2B-Instruct | 2.21 | 54.37 | 51.89 | 33.86 | 73.47 | |
|
|
|
|
|
|
|
#### MMEB-Image |
|
|
|
The table below compares performance on MMEB-Image benchmark among models of similar size. |
|
|
|
| Model | Model Size (B) | Image-Overall | I-CLS | I-QA | I-RET | I-VG | |
|
| ---------------------- | -------------- | ------------- | ----- | ----- | ----- | ----- | |
|
| Ops-MM-embedding-v1-2B | 2.21 | **69.03** | 68.07 | 65.11 | 69.17 | 80.85 | |
|
| B3_Qwen2_2B | 2.21 | 68.1 | 67 | 61.19 | 70.85 | 79.88 | |
|
| LLaVE-2B | 1.95 | 65.2 | 62.1 | 60.2 | 65.2 | 84.9 | |
|
|
|
|
|
|
|
#### ViDoRe-v2 |
|
|
|
| Model | Avg | ESG Restaurant Human | MIT Bio | Econ. Macro | ESG Restaurant Synth. | MIT Bio Multi. | Econ Macro Multi. | ESG Restaurant Synth. Multi. | |
|
| ---------------------- | -------- | -------------------- | ------- | ----------- | --------------------- | -------------- | ----------------- | ---------------------------- | |
|
| gme-7B | 59.3 | 65.8 | 64 | 62.9 | 54.3 | 55.1 | 56.2 | 56.7 | |
|
| seed 1.6 embedding | 58.9 | 63.3 | 63.9 | 64.0 | 58.4 | 57.1 | 53.8 | 52.0 | |
|
| Ops-MM-embedding-v1-7B | **60.6** | 66.3 | 58.4 | 67.4 | 60.0 | 54.3 | 60.9 | 56.8 | |
|
| Ops-MM-embedding-v1-2B | 54.4 | 58.6 | 56.0 | 56.4 | 55.8 | 52.9 | 47.9 | 53.4 | |
|
|
|
|
|
|
|
|
|
## Usage |
|
|
|
```python |
|
from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image |
|
|
|
|
|
model = OpsMMEmbeddingV1( |
|
"OpenSearch-AI/Ops-MM-embedding-v1-2B", |
|
device="cuda", |
|
attn_implementation="flash_attention_2" |
|
) |
|
|
|
t2i_prompt = "Find an image that matches the given text." |
|
texts = [ |
|
"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.", |
|
"Alibaba office.", |
|
"Alibaba office.", |
|
] |
|
images = [ |
|
"https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg", |
|
"https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg", |
|
"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg" |
|
] |
|
|
|
images = [fetch_image(image) for image in images] |
|
|
|
# Text and image embedding |
|
text_embeddings = model.get_text_embeddings(texts) |
|
image_embeddings = model.get_image_embeddings(images) |
|
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist()) |
|
|
|
# Fused Embedding |
|
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt) |
|
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist()) |
|
|
|
# Multi-image embeddings |
|
multi_images = [ |
|
[images[0]], |
|
[images[1], images[2]], |
|
] |
|
multi_image_embeddings = model.get_image_embeddings(multi_images) |
|
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist()) |
|
|
|
``` |