unum-cloud
/

uform-gen

text-generation

image-captioning

visual-question-answering

Inference Endpoints

Model card Files Files and versions Community

uform-gen / README.md

kimihailv's picture

Update README.md

93b750b about 1 year ago

|

3.7 kB

	---
	license: apache-2.0
	language:
	- en
	---
	<h1 align="center">UForm</h1>
	<h3 align="center">
	Pocket-Sized Multimodal AI<br/>
	For Content Understanding and Generation<br/>
	</h3>

	## Description

	UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

	1. [UForm Vision Encoder](https://huggingface.co/unum-cloud/uform-vl-english)
	2. [Sheared-LLaMA-1.3B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-1.3B) manually tuned on the instruction dataset

	The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets.

	### Usage

	```bash
	pip install uform
	```

	The generative model can be used to caption images, summarize their content, or answer questions about them.
	The exact behavior is controlled by prompts.

	```python
	from uform.gen_model import VLMForCausalLM, VLMProcessor

	model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
	processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")

	# [cap] Narrate the contents of the image with precision.
	# [cap] Summarize the visual content of the image.
	# [vqa] What is the main subject of the image?
	prompt = "[cap] Summarize the visual content of the image."
	image = Image.open("zebra.jpg")

	inputs = processor(texts=[prompt], images=[image], return_tensors="pt")
	with torch.inference_mode():
	output = model.generate(
	**inputs,
	do_sample=False,
	use_cache=True,
	max_new_tokens=128,
	eos_token_id=32001,
	pad_token_id=processor.tokenizer.pad_token_id
	)

	prompt_len = inputs["input_ids"].shape[1]
	decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
	```


	## Evaluation

	For captioning evaluation we measure CLIPScore and RefCLIPScore¹.

	\| Model \| Size \| Caption Length \| CLIPScore \| RefCLIPScore \|
	\| :---------------------------------- \| ---: \| -------------: \| --------: \| -----------: \|
	\| `llava-hf/llava-1.5-7b-hf` \| 7B \| Long \| 0.878 \| 0.529 \|
	\| `llava-hf/llava-1.5-7b-hf` \| 7B \| Short \| 0.886 \| 0.531 \|
	\| \|
	\| `Salesforce/instructblip-vicuna-7b` \| 7B \| Long \| 0.902 \| 0.534 \|
	\| `Salesforce/instructblip-vicuna-7b` \| 7B \| Short \| 0.848 \| 0.523 \|
	\| \|
	\| `unum-cloud/uform-gen` \| 1.5B \| Long \| 0.847 \| 0.523 \|
	\| `unum-cloud/uform-gen` \| 1.5B \| Short \| 0.842 \| 0.522 \|
	\| \|
	\| `unum-cloud/uform-gen-chat` \| 1.5B \| Long \| 0.860 \| 0.525 \|
	\| `unum-cloud/uform-gen-chat` \| 1.5B \| Short \| 0.858 \| 0.525 \|

	Results for VQAv2 evaluation.

	\| Model \| Size \| Accuracy \|
	\| :------------------------- \| ---: \| -------: \|
	\| `llava-hf/llava-1.5-7b-hf` \| 7B \| 78.5 \|
	\| `unum-cloud/uform-gen` \| 1.5B \| 66.5 \|

	¹ We used `apple/DFN5B-CLIP-ViT-H-14-378` CLIP model.


	## Speed

	On RTX 3090, the following performance is expected on text token generation using `float16`, equivalent PyTorch settings, and greedy decoding.

	\| Model \| Size \| Speed \| Speedup \|
	\| :---------------------------------- \| ---: \| ------------------: \| --------: \|
	\| `llava-hf/llava-1.5-7b-hf` \| 7B \| ~ 40 tokens/second \| \|
	\| `Salesforce/instructblip-vicuna-7b` \| 7B \| ~ 40 tokens/second \| \|
	\| `unum-cloud/uform-gen` \| 1.5B \| ~ 140 tokens/second \| __x 3.5__ \|