File size: 3,112 Bytes
31787ef
 
c7cb96c
 
 
 
 
 
31787ef
 
c7cb96c
31787ef
c7cb96c
31787ef
c7cb96c
31787ef
c7cb96c
31787ef
c7cb96c
 
 
 
31787ef
c7cb96c
31787ef
c7cb96c
31787ef
c7cb96c
31787ef
c7cb96c
31787ef
c7cb96c
 
 
 
 
 
31787ef
c7cb96c
31787ef
c7cb96c
 
 
 
31787ef
c7cb96c
 
 
 
 
 
 
 
 
 
 
31787ef
 
 
c7cb96c
31787ef
7cca429
cc244c8
7cca429
c7cb96c
31787ef
c7cb96c
 
 
 
31787ef
c7cb96c
 
 
31787ef
 
c7cb96c
 
 
 
 
 
 
 
 
31787ef
c7cb96c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: image-to-text
---

# Qwen2-VL-7B-Captioner-Relaxed

## Introduction

Qwen2-VL-7B-Captioner-Relaxed is an instruction-tuned version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. This fine-tuned version is based on a hand-curated dataset for text-to-image models, providing significantly more detailed descriptions of given images.

### Key Features:

* **Enhanced Detail:** Generates more comprehensive and nuanced image descriptions.
* **Relaxed Constraints:** Offers less restrictive image descriptions compared to the base model.
* **Natural Language Output:** Describes different subjects in the image while specifying their locations using natural language.
* **Optimized for Image Generation:** Produces captions in formats compatible with state-of-the-art text-to-image generation models.

**Note:** This fine-tuned model is optimized for creating text-to-image datasets. As a result, performance on other tasks (e.g., ~10% decrease on mmmu_val) may be lower compared to the original model.

## Requirements

If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, try installing the latest version of the transformers library from source:

`pip install git+https://github.com/huggingface/transformers`

## Quickstart
```python
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig
import torch

model_id = "Ertugrul/Qwen2-VL-7B-Captioner-Relaxed"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]



image = Image.open(r"PATH_TO_YOUR_IMAGE")

# you can resize the image here if it's not fitting to vram, or set model max sizes.
# image = image.resize((1024, 1024)) # like this

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

with torch.no_grad():
    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
        output_ids  = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.7, use_cache=True, top_k=50)


generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
print(output_text)
```

For more detailed options, refer to the [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) documentation.