unum-cloud
/

uform-gen

+---
+license: apache-2.0
+language:
+- en
+---
+<h1 align="center">UForm</h1>
+<h3 align="center">
+Pocket-Sized Multimodal AI<br/>
+For Content Understanding and Generation<br/>
+</h3>
+## Description
+UForm-Gen is a small generative vision-language model primarly designed for Image Captioning and Visual Question Answering. The model consists of two parts:
+1. [UForm Vision Encoder](https://huggingface.co/unum-cloud/uform-vl-english)
+2. [Sheared-LLaMA-1.3B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-1.3B) manually tuned on the instruction dataset
+The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets.
+### Usage
+```bash
+pip install uform
+```
+The generative model can be used to caption images, summarize their content, or answer questions about them.
+The exact behavior is controlled by prompts.
+```python
+from uform.gen_model import VLMForCausalLM, VLMProcessor
+model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
+processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")
+# [cap] Narrate the contents of the image with precision.
+# [cap] Summarize the visual content of the image.
+# [vqa] What is the main subject of the image?
+prompt = "[cap] Summarize the visual content of the image."
+image = Image.open("zebra.jpg")
+inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
+with torch.inference_mode():
+     output = model.generate(
+        **inputs,
+        do_sample=False,
+        use_cache=True,
+        max_new_tokens=128,
+        eos_token_id=32001,
+        pad_token_id=processor.tokenizer.pad_token_id
+    )
+prompt_len = inputs["input_ids"].shape[1]
+decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
+```
+## Evaluation
+For captioning evaluation we measure CLIPScore and RefCLIPScore¹.
+| Model                               | Size | Caption Length | CLIPScore | RefCLIPScore |
+| :---------------------------------- | ---: | -------------: | --------: | -----------: |
+| `llava-hf/llava-1.5-7b-hf`          |   7B |           Long |     0.878 |        0.529 |
+| `llava-hf/llava-1.5-7b-hf`          |   7B |          Short |     0.886 |        0.531 |
+|                                     |
+| `Salesforce/instructblip-vicuna-7b` |   7B |           Long |     0.902 |        0.534 |
+| `Salesforce/instructblip-vicuna-7b` |   7B |          Short |     0.848 |        0.523 |
+|                                     |
+| `unum-cloud/uform-gen`              | 1.5B |           Long |     0.847 |        0.523 |
+| `unum-cloud/uform-gen`              | 1.5B |          Short |     0.842 |        0.522 |
+|                                     |
+| `unum-cloud/uform-gen-chat`         | 1.5B |           Long |     0.860 |        0.525 |
+| `unum-cloud/uform-gen-chat`         | 1.5B |          Short |     0.858 |        0.525 |
+Results for VQAv2 evaluation.
+| Model                      | Size | Accuracy |
+| :------------------------- | ---: | -------: |
+| `llava-hf/llava-1.5-7b-hf` |   7B |     78.5 |
+| `unum-cloud/uform-gen`     | 1.5B |     66.5 |
+¹ We used `apple/DFN5B-CLIP-ViT-H-14-378` CLIP model.
+## Speed
+On RTX 3090, the following performance is expected on text token generation using `float16`, equivalent PyTorch settings, and greedy decoding.
+| Model                               | Size |               Speed |   Speedup |
+| :---------------------------------- | ---: | ------------------: | --------: |
+| `llava-hf/llava-1.5-7b-hf`          |   7B |  ~ 40 tokens/second |           |
+| `Salesforce/instructblip-vicuna-7b` |   7B |  ~ 40 tokens/second |           |
+| `unum-cloud/uform-gen`              | 1.5B | ~ 140 tokens/second | __x 3.5__ |