Safetensors
qwen2_vl
vidore
reranker
File size: 5,219 Bytes
de61fd2
 
70d3e6f
63330a9
 
9c21e85
30edd9b
 
 
 
de61fd2
c109f15
de61fd2
 
c14099d
6d48017
 
de61fd2
6d48017
78b0fe8
de61fd2
 
 
 
 
 
 
 
 
 
d902199
9205513
6c562c8
c8db55c
 
d902199
de61fd2
d902199
9205513
d902199
 
de61fd2
d902199
 
 
 
 
de61fd2
d902199
 
 
 
 
 
 
 
 
 
 
 
6c562c8
d902199
 
de61fd2
 
d902199
 
 
 
 
 
de61fd2
d902199
 
 
de61fd2
d902199
de61fd2
 
 
 
4bbbdf2
 
de61fd2
 
 
 
c109f15
de61fd2
0a0e17c
 
 
 
 
 
 
 
 
 
cf1d546
de61fd2
 
 
 
70d3e6f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: apache-2.0
tags:
- vidore
- reranker
- qwen2_vl
datasets:
- vidore/colpali_train_set
base_model:
- Qwen/Qwen2-VL-2B-Instruct
---
# MonoQwen2-VL-v0.1

## Model Overview
The **MonoQwen2-VL-v0.1** is a multimodal reranker finetuned with LoRA from [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), optimized for asserting pointwise image-query relevance using the [MonoT5](https://arxiv.org/pdf/2101.05667) objective.
That is, given a couple of image and query fed into the prompt of the VLM, the model is tasked to generate "True" if the image is relevant to the query and "False" otherwise.
During inference, a relevancy score can then be obtained by comparing the logits of the two tokens and this score can effectively be used to rerank the candidates generated by a first-stage retriever (such as DSE or ColPali) or filter them using a threshold.

The [ColPali train set](https://huggingface.co/datasets/vidore/colpali_train_set) was used to train this model with negatives mined using DSE.

## How to Use the Model
Below is a quick example to rerank a single image against a user query using this model:

```python
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

# Load processor and model
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "lightonai/MonoQwen2-VL-v0.1",
    device_map="auto",
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)

# Define query and load image
query = "What is ColPali?"
image_path = "your/path/to/image.png"
image = Image.open(image_path)

# Construct the prompt and prepare input
prompt = (
    "Assert the relevance of the previous image document to the following query, "
    "answer True or False. The query is: {query}"
).format(query=query)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# Apply chat template and tokenize
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to("cuda")

# Run inference to obtain logits
with torch.no_grad():
    outputs = model(**inputs)
    logits_for_last_token = outputs.logits[:, -1, :]

# Convert tokens and calculate relevance score
true_token_id = processor.tokenizer.convert_tokens_to_ids("True")
false_token_id = processor.tokenizer.convert_tokens_to_ids("False")
relevance_score = torch.softmax(logits_for_last_token[:, [true_token_id, false_token_id]], dim=-1)

# Extract and display probabilities
true_prob = relevance_score[0, 0].item()
false_prob = relevance_score[0, 1].item()

print(f"True probability: {true_prob:.4f}, False probability: {false_prob:.4f}")
```

This example demonstrates how to use the model to assess the relevance of an image with respect to a query. It outputs the probability that the image is relevant ("True") or not relevant ("False").

**Note**: this example requires `peft` to be installed in your environment (`pip install peft`). If you don't want to use `peft`, you can use model.[load_adapter](https://huggingface.co/docs/transformers/peft#transformers.integrations.PeftAdapterMixin.load_adapter) on the original Qwen2-VL-2B model.

## Performance Metrics

The model has been evaluated on [ViDoRe Benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard), by retrieving 10 elements with [MrLight_dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and reranking them. The table below summarizes its `ndcg@5` scores:

| Dataset                                           | [MrLight_dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1)  | MonoQwen2-VL-v0.1 reranking |
|---------------------------------------------------|--------------------------|------------------------|
| vidore/arxivqa_test_subsampled                    | 85.6                     | 89.0                   |
| vidore/docvqa_test_subsampled                     | 57.1                     | 59.7                   |
| vidore/infovqa_test_subsampled                    | 88.1                     | 93.2                   |
| vidore/tabfquad_test_subsampled                   | 93.1                     | 96.0                   |
| vidore/shiftproject_test                          | 82.0                     | 93.0                   |
| vidore/syntheticDocQA_artificial_intelligence_test| 97.5                     | 100.0                  |
| vidore/syntheticDocQA_energy_test                 | 92.9                     | 97.7                   |
| vidore/syntheticDocQA_government_reports_test     | 96.0                     | 98.0                   |
| vidore/syntheticDocQA_healthcare_industry_test    | 96.4                     | 99.3                   |
| vidore/tatdqa_test                                | 69.4                     | 79.0                   |
| **Mean**                                          | 85.8                     | 90.5                   |


## License

This LoRA model is licensed under the Apache 2.0 license.