File size: 3,112 Bytes
31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef 7cca429 cc244c8 7cca429 c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c 31787ef c7cb96c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: image-to-text
---
# Qwen2-VL-7B-Captioner-Relaxed
## Introduction
Qwen2-VL-7B-Captioner-Relaxed is an instruction-tuned version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. This fine-tuned version is based on a hand-curated dataset for text-to-image models, providing significantly more detailed descriptions of given images.
### Key Features:
* **Enhanced Detail:** Generates more comprehensive and nuanced image descriptions.
* **Relaxed Constraints:** Offers less restrictive image descriptions compared to the base model.
* **Natural Language Output:** Describes different subjects in the image while specifying their locations using natural language.
* **Optimized for Image Generation:** Produces captions in formats compatible with state-of-the-art text-to-image generation models.
**Note:** This fine-tuned model is optimized for creating text-to-image datasets. As a result, performance on other tasks (e.g., ~10% decrease on mmmu_val) may be lower compared to the original model.
## Requirements
If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, try installing the latest version of the transformers library from source:
`pip install git+https://github.com/huggingface/transformers`
## Quickstart
```python
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig
import torch
model_id = "Ertugrul/Qwen2-VL-7B-Captioner-Relaxed"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
image = Image.open(r"PATH_TO_YOUR_IMAGE")
# you can resize the image here if it's not fitting to vram, or set model max sizes.
# image = image.resize((1024, 1024)) # like this
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
output_ids = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.7, use_cache=True, top_k=50)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
print(output_text)
```
For more detailed options, refer to the [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) documentation.
|