|
--- |
|
datasets: |
|
- liuhaotian/LLaVA-Pretrain |
|
- liuhaotian/LLaVA-Instruct-150K |
|
language: |
|
- en |
|
tags: |
|
- llava |
|
- phi |
|
license: mit |
|
library_name: transformers |
|
widget: |
|
- text: "What animal is it?" |
|
src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" |
|
- text: "Where is it?" |
|
src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg" |
|
--- |
|
|
|
# Multi-crop LLaVA-3b |
|
|
|
<a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> |
|
</a> |
|
|
|
## Model details |
|
|
|
The core idea behind multi-crop LLaVA (MC-LLaVA) is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image. |
|
Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better. |
|
|
|
For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which |
|
gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later. |
|
|
|
MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from |
|
[SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384). |
|
|
|
The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more. |
|
|
|
As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format: |
|
|
|
``` |
|
<|im_start|>system |
|
You are Dolphin, a helpful AI assistant.<|im_end|> |
|
<|im_start|>user |
|
{prompt}<|im_end|> |
|
<|im_start|>assistant |
|
``` |
|
|
|
## How to use |
|
|
|
**Install dependencies** |
|
|
|
```bash |
|
!pip install -q open_clip_torch timm einops |
|
``` |
|
|
|
**Download modeling files** |
|
|
|
```python |
|
from huggingface_hub import hf_hub_download |
|
|
|
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True) |
|
hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True) |
|
``` |
|
|
|
**Create a model** |
|
|
|
```python |
|
from modeling_llava import LlavaForConditionalGeneration |
|
import torch |
|
|
|
model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16) |
|
model = model.to("cuda") |
|
``` |
|
|
|
**Create processors** |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b") |
|
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config) |
|
processor = LlavaProcessor(image_processor, tokenizer) |
|
``` |
|
|
|
**Set image and text** |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
|
|
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da" |
|
raw_image = Image.open(requests.get(image_file, stream=True).raw) |
|
|
|
prompt = """<|im_start|>system |
|
A chat between a curious human and an artificial intelligence assistant. |
|
The assistant gives helpful, detailed, and polite answers to the human's questions. |
|
The assistant does not hallucinate and pays very close attention to the details.<|im_end|> |
|
<|im_start|>user |
|
<image> |
|
Describe the image.<|im_end|> |
|
<|im_start|>assistant |
|
""" |
|
``` |
|
|
|
**Process inputs** |
|
|
|
```python |
|
inputs = processor(prompt, raw_image, model, return_tensors='pt') |
|
|
|
inputs['input_ids'] = inputs['input_ids'].to(model.device) |
|
inputs['attention_mask'] = inputs['attention_mask'].to(model.device) |
|
``` |
|
|
|
**Generate the data** |
|
|
|
```python |
|
import torch |
|
|
|
with torch.inference_mode(): |
|
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.4, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id) |
|
``` |
|
|
|
## Benchmarks |
|
|
|
- TextVQA - 38.59% |
|
- GQA - 49.6% |
|
- VQAv2 - 64.24% |
|
- VizWiz - 24.88% |
|
- POPE - 80.59% |
|
- V*-bench - 52.25% (OCR - 46.66%, GPT4V-hard - 41.17%, direct attributes - 43.48%, relative position - 65.79%) |
|
|
|
## Examples |
|
|
|
<a target="_blank" href="https://colab.research.google.com/drive/1sXDvVl5s9fTcE0N2bQGOlXhnNlKEdeun"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> |
|
</a> |
|
|
|
## License |
|
|
|
The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service. |
|
Which means don't create competitor models for them. |
|
|
|
## Acknowledgments |
|
|
|
Thanks to [ML Collective](https://mlcollective.org/) for providing credits for computing resources. |