Update README.md

ad55950 verified 25 days ago

4.92 kB

	---
	license: apache-2.0
	language:
	- multilingual
	base_model:
	- Qwen/Qwen2-VL-2B-Instruct
	tags:
	- mmeb
	- vidore
	- colpali
	- multimodal-embedding
	---
	### Ops-MM-embedding-v1-2B

	Ops-MM-embedding-v1-2B is a dense, large-scale multimodal embedding model developed and open-sourced by the Alibaba Cloud OpenSearch-AI team, fine-tuned from Qwen2-VL.


	### Key Features

	#### Unified Multimodal Embeddings
	- Encodes text, images, text-image pairs, visual documents, and videos (by treating video frames as multiple image inputs) into a unified embedding space for cross-modal retrieval.

	#### High Performance on MMEB
	- Achieves SOTA results among models of similar scale on MMEB-V2 and MMEB-Image benchmark (until 2025-07-03).

	#### Multilingual Capabilities
	- The larger variant (Ops-MM-embedding-v1-7B) achieves SOTA performance among dense models on the ViDoRe-v2 benchmark, demonstrating strong cross-lingual generalization.



	### Training data

	MMEB-train, CC-3M, colpali training set.


	### Performance

	#### MMEB-V2

	\| Model \| Model Size (B) \| Overall \| Image-Overall \| Video-Overall \| Visdoc-Overall \|
	\| ------------------------ \| -------------- \| ------- \| ------------- \| ------------- \| -------------- \|
	\| seed-1.6-embedding \| unknown \| 71.57 \| 77.78 \| 55.34 \| 74.41 \|
	\| Ops-MM-embedding-v1-7B \| 8.29 \| 67.79 \| 72.72 \| 53.76 \| 70.91 \|
	\| Ops-MM-embedding-v1-2B \| 2.21 \| 63.62 \| 69.03 \| 47.56 \| 67.55 \|
	\| VLM2Vec-V2.0-Qwen2VL-2B \| 2.21 \| 58.39 \| 64.85 \| 34.85 \| 66.34 \|
	\| gme-Qwen2-VL-2B-Instruct \| 2.21 \| 54.37 \| 51.89 \| 33.86 \| 73.47 \|



	#### MMEB-Image

	The table below compares performance on MMEB-Image benchmark among models of similar size.

	\| Model \| Model Size (B) \| Image-Overall \| I-CLS \| I-QA \| I-RET \| I-VG \|
	\| ---------------------- \| -------------- \| ------------- \| ----- \| ----- \| ----- \| ----- \|
	\| Ops-MM-embedding-v1-2B \| 2.21 \| 69.03 \| 68.07 \| 65.11 \| 69.17 \| 80.85 \|
	\| B3_Qwen2_2B \| 2.21 \| 68.1 \| 67 \| 61.19 \| 70.85 \| 79.88 \|
	\| LLaVE-2B \| 1.95 \| 65.2 \| 62.1 \| 60.2 \| 65.2 \| 84.9 \|



	#### ViDoRe-v2

	\| Model \| Avg \| ESG Restaurant Human \| MIT Bio \| Econ. Macro \| ESG Restaurant Synth. \| MIT Bio Multi. \| Econ Macro Multi. \| ESG Restaurant Synth. Multi. \|
	\| ---------------------- \| -------- \| -------------------- \| ------- \| ----------- \| --------------------- \| -------------- \| ----------------- \| ---------------------------- \|
	\| gme-7B \| 59.3 \| 65.8 \| 64 \| 62.9 \| 54.3 \| 55.1 \| 56.2 \| 56.7 \|
	\| seed 1.6 embedding \| 58.9 \| 63.3 \| 63.9 \| 64.0 \| 58.4 \| 57.1 \| 53.8 \| 52.0 \|
	\| Ops-MM-embedding-v1-7B \| 60.6 \| 66.3 \| 58.4 \| 67.4 \| 60.0 \| 54.3 \| 60.9 \| 56.8 \|
	\| Ops-MM-embedding-v1-2B \| 54.4 \| 58.6 \| 56.0 \| 56.4 \| 55.8 \| 52.9 \| 47.9 \| 53.4 \|




	## Usage

	```python
	from ops_mm_embedding_v1 import OpsMMEmbeddingV1, fetch_image


	model = OpsMMEmbeddingV1(
	"OpenSearch-AI/Ops-MM-embedding-v1-2B",
	device="cuda",
	attn_implementation="flash_attention_2"
	)

	t2i_prompt = "Find an image that matches the given text."
	texts = [
	"The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023.",
	"Alibaba office.",
	"Alibaba office.",
	]
	images = [
	"https://upload.wikimedia.org/wikipedia/commons/e/e9/Tesla_Cybertruck_damaged_window.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/e/e0/TaobaoCity_Alibaba_Xixi_Park.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Alibaba_Binjiang_Park.jpg/1024px-Alibaba_Binjiang_Park.jpg"
	]

	images = [fetch_image(image) for image in images]

	# Text and image embedding
	text_embeddings = model.get_text_embeddings(texts)
	image_embeddings = model.get_image_embeddings(images)
	print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

	# Fused Embedding
	text_with_image_embeddings = model.get_fused_embeddings(texts=texts, images=images, instruction=t2i_prompt)
	print('Text and image embeddings', (text_embeddings @ image_embeddings.T).tolist())

	# Multi-image embeddings
	multi_images = [
	[images[0]],
	[images[1], images[2]],
	]
	multi_image_embeddings = model.get_image_embeddings(multi_images)
	print('Multi-image embeddings', (multi_image_embeddings @ multi_image_embeddings.T).tolist())

	```