osunlp
/

UGround-V1-7B

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

BoyuNLP commited on Jan 3

Commit

4946fb3

·

verified ·

1 Parent(s): 14c14d5

Update README.md

Files changed (1) hide show

README.md +41 -0

README.md CHANGED Viewed

@@ -38,6 +38,47 @@ UGround is a storng GUI visual grounding model trained with a simple recipe. Che
 - [x] Online Demo (HF Spaces)
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)
 ## Citation Information

 - [x] Online Demo (HF Spaces)
+## Inference
+### vLLM server
+```bash
+vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16
+```
+### Visual Grounding Prompt
+```python
+def format_openai_template(description: str, base64_image):
+    return [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
+                },
+                {
+                    "type": "text",
+                    "text": f"""
+  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
+  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
+  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
+  - Your answer should be a single string (x, y) corresponding to the point of the interest.
+  Description: {description}
+  Answer:"""
+                },
+            ],
+        },
+    ]
+```
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)
 ## Citation Information