|
--- |
|
library_name: transformers |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
Ferret-UI is the first UI-centric multimodal large language model (MLLM) designed for referring, grounding, and reasoning tasks. |
|
Built on Gemma-2B and Llama-3-8B, it is capable of executing complex UI tasks. |
|
This is the **Gemma-2B** version of ferret-ui. It follows from [this paper](https://arxiv.org/pdf/2404.05719) by Apple. |
|
|
|
|
|
## How to Use 🤗📱 |
|
|
|
You will need first to download `builder.py`, `conversation.py`, `inference.py` and `model_UI.py` locally. |
|
|
|
```bash |
|
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py |
|
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py |
|
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py |
|
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py |
|
``` |
|
|
|
### Usage: |
|
```python |
|
from inference import inference_and_run |
|
image_path = "appstore_reminders.png" |
|
prompt = "Describe the image in details" |
|
|
|
# Call the function without a box |
|
processed_image, inference_text = inference_and_run(image_path, prompt, conv_mode="ferret_gemma_instruct", model_path="jadechoghari/Ferret-UI-Gemma2b") |
|
|
|
# Output processed text |
|
print("Inference Text:", inference_text) |
|
``` |
|
|
|
```python |
|
# Task with bounding boxes |
|
image_path = "appstore_reminders.png" |
|
prompt = "What's inside the selected region?" |
|
box = [189, 906, 404, 970] |
|
|
|
processed_image, inference_text = inference_and_run( |
|
image_path=image_path, |
|
prompt=prompt, |
|
conv_mode="ferret_gemma_instruct", |
|
model_path="jadechoghari/Ferret-UI-Gemma2b", |
|
box=box |
|
) |
|
|
|
# otput the inference text and optionally save the processed image |
|
print("Inference Text:", inference_text) |
|
``` |
|
|
|
```python |
|
# GROUNDING PROMPTS |
|
GROUNDING_TEMPLATES = [ |
|
'\nProvide the bounding boxes of the mentioned objects.', |
|
'\nInclude the coordinates for each mentioned object.', |
|
'\nLocate the objects with their coordinates.', |
|
'\nAnswer in [x1, y1, x2, y2] format.', |
|
'\nMention the objects and their locations using the format [x1, y1, x2, y2].', |
|
'\nDraw boxes around the mentioned objects.', |
|
'\nUse boxes to show where each thing is.', |
|
'\nTell me where the objects are with coordinates.', |
|
'\nList where each object is with boxes.', |
|
'\nShow me the regions with boxes.' |
|
] |
|
``` |