metadata

pipeline_tag: image-to-text
tags:
  - image-captioning
  - visual-question-answering
datasets:
  - sbu_captions
  - visual_genome
  - HuggingFaceM4/VQAv2
  - ChristophSchuhmann/MS_COCO_2017_URL_TEXT
widget:
  - text: >-
      The living room is cozy, featuring a red leather chair and a white table.
      The chair is in the center, and the table is on the left side. A lamp on
      the left side illuminates the space. A large picture hangs on the wall,
      adding artistic flair. A vase on the table adds a decorative touch. The
      room is well-lit, creating a warm and inviting atmosphere.
    src: >-
      https://github.com/ashvardanian/usearch-images/blob/main/assets/uform-gen-interior.png?raw=true
  - text: >-
      A young girl stands in a grassy field, holding an umbrella to shield
      herself from the rain. She dons a yellow dress and seems to relish her
      time outdoors. The umbrella is open, offering protection from the rain.
      The field is bordered by trees, fostering a tranquil and natural ambiance
    src: >-
      https://github.com/ashvardanian/usearch-images/blob/main/assets/uform-gen-umbrella.png?raw=true
language:
  - en
license: apache-2.0
base_model: unum-cloud/uform-vl-english

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

uform-vl-english visual encoder,
Sheared-LLaMA-1.3B language model tuned on instruction datasets.

The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets.

Usage

pip install uform

The generative model can be used to caption images, summarize their content, or answer questions about them. The exact behavior is controlled by prompts.

from uform.gen_model import VLMForCausalLM, VLMProcessor

model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")

# [cap] Narrate the contents of the image with precision.
# [cap] Summarize the visual content of the image.
# [vqa] What is the main subject of the image?
prompt = "[cap] Summarize the visual content of the image."
image = Image.open("zebra.jpg")

inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=128,
        eos_token_id=32001,
        pad_token_id=processor.tokenizer.pad_token_id
    )

prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

Evaluation

For captioning evaluation we measure CLIPScore and RefCLIPScore¹.

Model	Size	Caption Length	CLIPScore	RefCLIPScore
`llava-hf/llava-1.5-7b-hf`	7B	Long	0.878	0.529
`llava-hf/llava-1.5-7b-hf`	7B	Short	0.886	0.531

`Salesforce/instructblip-vicuna-7b`	7B	Long	0.902	0.534
`Salesforce/instructblip-vicuna-7b`	7B	Short	0.848	0.523

`unum-cloud/uform-gen`	1.5B	Long	0.847	0.523
`unum-cloud/uform-gen`	1.5B	Short	0.842	0.522

Results for VQAv2 evaluation.

Model	Size	Accuracy
`llava-hf/llava-1.5-7b-hf`	7B	78.5
`unum-cloud/uform-gen`	1.5B	66.5

¹ We used apple/DFN5B-CLIP-ViT-H-14-378 CLIP model.

Speed

On RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model	Size	Speed	Speedup
`llava-hf/llava-1.5-7b-hf`	7B	~ 40 tokens/second
`Salesforce/instructblip-vicuna-7b`	7B	~ 40 tokens/second
`unum-cloud/uform-gen`	1.5B	~ 140 tokens/second	x 3.5