Image-Text-to-Text
Transformers
Safetensors
English
Chinese
llava
vision-language
llm
lmm
conversational
Inference Endpoints
tiny-llava-v1-hf / README.md
bczhou's picture
Update README.md
a359739 verified
|
raw
history blame
3.82 kB
metadata
license: mit
datasets:
  - liuhaotian/LLaVA-Pretrain
language:
  - en
  - zh
library_name: transformers

WORK IN PROGRESS

Model type

TinyLLaVA, a tiny model (1.4B) trained using the exact training recipe of LLaVA-1.5. We trained our TinyLLaVA using TinyLlama as our LLM backbone, and clip-vit-large-patch14-336 as our vision backbone.

Model Performance

We have evaluated TinyLLaVA on GQA, VizWiz, VQAv2, TextVQA and SQA.

Model VQAv2 GQA SQA TextVQA VizWiz
TinyLLaVA-v1-1.4B 73.41 57.54 59.40 46.37 49.56
BLIP-2 41.00 41.00 61.00 42.50 19.60
LLaVA-v1.5-7B 78.50 62.00 66.80 61.3 50
LLaVA-v1.5-13B 80.00 63.30 71.60 61.3 53.6
Qwen-VL-7B 78.80 59.30 67.10 63.8 35.2
Qwen-VL-13B 78.20 57.50 68.20 61.5 38.9

More evaluations are ongoing.

Model use

The weights have been converted to hf format.

How to use the model

First, make sure to have transformers >= 4.35.3. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template (USER: xxx\nASSISTANT:) and add the token <image> to the location where you want to query images:

Using pipeline:

Below we used "bczhou/tiny-llava-v1-hf" checkpoint.

from transformers import pipeline
from PIL import Image
import requests
model_id = "bczhou/tiny-llava-v1-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0])
>>> {"generated_text': 'USER:  \nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT: The label 15 represents lava, which is a type of volcanic rock."}

Using pure transformers:

Below is an example script to run generation in float16 precision on a GPU device:

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "bczhou/tiny-llava-v1-hf"
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))