|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
- google/siglip-so400m-patch14-384 |
|
datasets: |
|
- weizhiwang/Open-Qwen2VL-Data |
|
- MAmmoTH-VL/MAmmoTH-VL-Instruct-12M |
|
language: |
|
- en |
|
license: cc |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
# Model Card for Open-Qwen2VL |
|
|
|
Open-Qwen2VL is a multimodal model that takes images and text as input and produces text as output. This model is described in the paper [Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources](https://huggingface.co/papers/2504.00595). The code is available at [https://github.com/Victorwz/Open-Qwen2VL](https://github.com/Victorwz/Open-Qwen2VL). |
|
|
|
<!-- Please follow my reproduced implementation [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM. --> |
|
|
|
## Updates |
|
<!-- - [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models. --> |
|
|
|
## Model Details |
|
<!-- Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3. --> |
|
|
|
## How to Use |
|
|
|
Please firstly install Open-Qwen2VL via |
|
``` |
|
pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms |
|
``` |
|
|
|
You can load the model and perform inference as follows: |
|
```python |
|
import requests |
|
import torch |
|
from PIL import Image |
|
from prismatic import load |
|
|
|
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") |
|
|
|
# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub) |
|
vlm = load("Open-Qwen2VL") |
|
vlm.to(device, dtype=torch.bfloat16) |
|
|
|
# Download an image and specify a prompt |
|
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png" |
|
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") |
|
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)] |
|
user_prompt = '<image>' + ' |
|
' + "Describe the image." |
|
|
|
# Generate! |
|
generated_text = vlm.generate_batch( |
|
image, |
|
[user_prompt], |
|
do_sample=False, |
|
max_new_tokens=512, |
|
min_length=1, |
|
) |
|
print(generated_text[0]) |
|
``` |
|
The image caption results look like: |
|
``` |
|
The image depicts a blue and orange bus parked on the side of a street. ... |
|
``` |
|
|
|
<!-- # Fine-Tune LLaVA-Llama-3 on Your Visual Instruction Data ... --> |
|
|
|
## Citation |
|
<!-- |
|
```bibtex |
|
@misc{wang2024llavallama3, |
|
... |
|
``` --> |