metadata
license: apache-2.0
language:
- en
UForm
Pocket-Sized Multimodal AI
For Content Understanding and Generation
Description
UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:
- UForm Vision Encoder
- Sheared-LLaMA-1.3B manually tuned on the instructions dataset
The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets.
Usage
pip install uform
The generative model can be used to caption images, summarize their content, or answer questions about them. The exact behavior is controlled by prompts.
from uform.gen_model import VLMForCausalLM, VLMProcessor
model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")
# [cap] Narrate the contents of the image with precision.
# [cap] Summarize the visual content of the image.
# [vqa] What is the main subject of the image?
prompt = "[cap] Summarize the visual content of the image."
image = Image.open("zebra.jpg")
inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
output = model.generate(
**inputs,
do_sample=False,
use_cache=True,
max_new_tokens=128,
eos_token_id=32001,
pad_token_id=processor.tokenizer.pad_token_id
)
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
Evaluation
For captioning evaluation we measure CLIPScore and RefCLIPScore¹.
Model | Size | Caption Length | CLIPScore | RefCLIPScore |
---|---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | Long | 0.878 | 0.529 |
llava-hf/llava-1.5-7b-hf |
7B | Short | 0.886 | 0.531 |
Salesforce/instructblip-vicuna-7b |
7B | Long | 0.902 | 0.534 |
Salesforce/instructblip-vicuna-7b |
7B | Short | 0.848 | 0.523 |
unum-cloud/uform-gen |
1.5B | Long | 0.847 | 0.523 |
unum-cloud/uform-gen |
1.5B | Short | 0.842 | 0.522 |
Results for VQAv2 evaluation.
Model | Size | Accuracy |
---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | 78.5 |
unum-cloud/uform-gen |
1.5B | 66.5 |
¹ We used apple/DFN5B-CLIP-ViT-H-14-378
CLIP model.
Speed
On RTX 3090, the following performance is expected on text token generation using float16
, equivalent PyTorch settings, and greedy decoding.
Model | Size | Speed | Speedup |
---|---|---|---|
llava-hf/llava-1.5-7b-hf |
7B | ~ 40 tokens/second | |
Salesforce/instructblip-vicuna-7b |
7B | ~ 40 tokens/second | |
unum-cloud/uform-gen |
1.5B | ~ 140 tokens/second | x 3.5 |