Update README.md
Browse files
README.md
CHANGED
@@ -38,6 +38,47 @@ UGround is a storng GUI visual grounding model trained with a simple recipe. Che
|
|
38 |
- [x] Online Demo (HF Spaces)
|
39 |
|
40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |

|
42 |
|
43 |
## Citation Information
|
|
|
38 |
- [x] Online Demo (HF Spaces)
|
39 |
|
40 |
|
41 |
+
## Inference
|
42 |
+
|
43 |
+
### vLLM server
|
44 |
+
|
45 |
+
```bash
|
46 |
+
vllm serve osunlp/UGround-V1-7B --api-key token-abc123 --dtype float16
|
47 |
+
```
|
48 |
+
|
49 |
+
### Visual Grounding Prompt
|
50 |
+
```python
|
51 |
+
def format_openai_template(description: str, base64_image):
|
52 |
+
return [
|
53 |
+
{
|
54 |
+
"role": "user",
|
55 |
+
"content": [
|
56 |
+
{
|
57 |
+
"type": "image_url",
|
58 |
+
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
|
59 |
+
},
|
60 |
+
{
|
61 |
+
"type": "text",
|
62 |
+
"text": f"""
|
63 |
+
Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
|
64 |
+
|
65 |
+
- Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
|
66 |
+
- If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
|
67 |
+
- Your answer should be a single string (x, y) corresponding to the point of the interest.
|
68 |
+
|
69 |
+
Description: {description}
|
70 |
+
|
71 |
+
Answer:"""
|
72 |
+
},
|
73 |
+
],
|
74 |
+
},
|
75 |
+
]
|
76 |
+
```
|
77 |
+
|
78 |
+
|
79 |
+
|
80 |
+
|
81 |
+
|
82 |

|
83 |
|
84 |
## Citation Information
|