frozenc's picture
Update README.md
6bfce3d verified
|
raw
history blame
4.91 kB
metadata
license: apache-2.0
language:
  - multilingual
base_model:
  - Qwen/Qwen2-VL-2B-Instruct
tags:
  - mmeb
  - vidore
  - colpali
  - multimodal-embedding

Ops-MM-embedding-v1-2B

Ops-MM-embedding-v1-2B is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.

Key Features

Unified Multimodal Embeddings

  • Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.

High Performance on MMEB

  • Achieves SOTA results among models of similar scale on MMEB-V2 and MMEB-Image benchmark (until 2025-07-03).

Multilingual Capabilities

  • The larger variant (Ops-MM-embedding-v1-7B) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.

Training data

MMEB-train, CC-3M, colpali training set.

Performance

MMEB-V2

Model Model Size (B) Overall Image-Overall Video-Overall Visdoc-Overall
seed-1.6-embedding unknown 71.57 77.78 55.34 74.41
Ops-MM-embedding-v1-7B 8.29 67.79 72.72 53.76 70.91
Ops-MM-embedding-v1-2B 2.21 63.62 69.03 47.56 67.55
VLM2Vec-V2.0-Qwen2VL-2B 2.21 58.39 64.85 34.85 66.34
gme-Qwen2-VL-2B-Instruct 2.21 54.37 51.89 33.86 73.47

MMEB-Image

The table below compares performance on MMEB-Image benchmark among models of similar size.

Model Model Size (B) Image-Overall I-CLS I-QA I-RET I-VG
Ops-MM-embedding-v1-2B 2.21 69.03 68.07 65.11 69.17 80.85
B3_Qwen2_2B 2.21 68.1 67 61.19 70.85 79.88
LLaVE-2B 1.95 65.2 62.1 60.2 65.2 84.9

ViDoRe-v2

Model Avg ESG Restaurant Human MIT Bio Econ. Macro ESG Restaurant Synth. MIT Bio Multi. Econ Macro Multi. ESG Restaurant Synth. Multi.
gme-7B 59.3 65.8 64 62.9 54.3 55.1 56.2 56.7
seed 1.6 embedding 58.9 63.3 63.9 64.0 58.4 57.1 53.8 52.0
Ops-MM-embedding-v1-7B 60.6 66.3 58.4 67.4 60.0 54.3 60.9 56.8
Ops-MM-embedding-v1-2B 54.4 58.6 56.0 56.4 55.8 52.9 47.9 53.4

Usage

from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image


model = OpsMMEmbeddingV1(
    "OpenSearch-AI/Ops-MM-embedding-v1-2B",
    device="cuda",
    attn_implementation="flash_attention_2"
)

t2i_prompt = "Find an image that matches the given text."
texts = [
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
    "Alibaba office.",
    "Alibaba office.",
]
images = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
]

images = [fetch_image(image) for image in images]

# Text and image embedding
text_embeddings = model.get_text_embeddings(texts)
image_embeddings = model.get_image_embeddings(images)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Fused Embedding
text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

# Multi-image embeddings
multi_images = [
    [images[0]],
    [images[1], images[2]],
]
multi_image_embeddings = model.get_image_embeddings(multi_images)
print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())