BoyuNLP commited on
Commit
4ab60fa
·
verified ·
1 Parent(s): ee94829

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -51
README.md CHANGED
@@ -23,72 +23,79 @@ UGround is a strong GUI visual grounding model trained with a simple recipe. Che
23
 
24
  ## Models
25
 
26
- - Initial UGround-V1: https://huggingface.co/osunlp/UGround
27
- - UGround-V1-2B (Qwen2-VL): https://huggingface.co/osunlp/UGround-V1-2B
28
- - UGround-V1-7B (Qwen2-VL): https://huggingface.co/osunlp/UGround-V1-7B
29
- - UGround-V1-72B (Qwen2-VL): Coming Soon
30
- - UGround-V1.1-2B (Qwen2-VL): Coming Soon
31
- - UGround-V1.1-7B (Qwen2-VL): Coming Soon
32
- - UGround-V1.1-72B (Qwen2-VL): Coming Soon
33
 
34
  ## Release Plan
35
 
36
- - [x] Model Weights
37
- - [x] Initial V1 (the one used in the paper)
38
- - [x] Qwen2-VL-based V1
39
  - [x] 2B
40
  - [x] 7B
41
- - [ ] 72B
42
- - [ ] V1.1
43
- - [ ] Code
44
- - [x] Inference Code of UGround
45
- - [x] Offline Experiments
46
- - [x] Screenspot (along with referring expressions generated by GPT-4/4o)
47
  - [x] Multimodal-Mind2Web
48
  - [x] OmniAct
49
- - [ ] Android Control
50
- - [ ] Online Experiments
51
- - [ ] Mind2Web-Live-SeeAct-V
52
- - [ ] AndroidWorld-SeeAct-V
53
- - [ ] Data-V1
54
- - [ ] Data Examples
55
- - [ ] Data Construction Scripts
56
- - [ ] Guidance of Open-source Data
57
- - [ ] Data-V1.1
58
  - [x] Online Demo (HF Spaces)
59
 
60
 
 
 
61
  ## Main Results
62
 
 
63
  ### GUI Visual Grounding: ScreenSpot (Standard Setting)
64
 
65
 
66
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/SbqTAEZOWMM7vCzAD9JPo.png)
67
-
68
- | ScreenSpot (Standard) | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
69
- | ------------------------ | ---------------- | ------------------ | ----------- | ----------- | ------------ | ------------ | ---------- | ---------- | ---------- |
70
- | Groma | Groma | | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 |
71
- | Qwen-VL | Qwen-VL | | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 |
72
- | MiniGPT-v2 | MiniGPT-v2 | | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 |
73
- | GPT-4 | | | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 |
74
- | GPT-4o | | | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 |
75
- | Fuyu | Fuyu | | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 |
76
- | Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
77
- | Qwen2-VL | Qwen2-VL | | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 |
78
- | SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
79
- | OS-Atlas-Base-4B | InternVL | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
80
- | UGround-V1 | LLaVA-UGround-V1 | UGround-V1 | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
81
- | Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
82
- | ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
83
- | ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
84
- | UGround-V1-2B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
85
- | Aguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
86
- | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 81.0 |
87
- | Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
88
- | Aguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 83.0 |
89
- | UGround-V1-7B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 93.0 | **79.9** | **93.8** | **76.4** | **90.9** | **84.0** | **86.3** |
90
- | *AGUVIS-72B* | *Qwen2-VL* | *Aguvis-Stage-1&2* | ***94.5*** | ***85.2*** | *95.4* | *77.9* | *91.3* | ***85.9*** | *88.4* |
91
- | *UGround-V1-72B-Preview* | *Qwen2-VL* | *UGround-V1* | ***94.5*** | *82.1* | ***95.9*** | ***82.9*** | ***93.0*** | ***85.9*** | ***89.2*** |
 
 
 
 
 
 
 
 
 
92
 
93
  ### GUI Visual Grounding: ScreenSpot (Agent Setting)
94
 
 
23
 
24
  ## Models
25
 
26
+ - Model-V1:
27
+ - [Initial UGround](https://huggingface.co/osunlp/UGround):
28
+ - [UGround-V1-2B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-2B)
29
+ - [UGround-V1-7B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-7B)
30
+ - [UGround-V1-72B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-72B)
31
+ - [Training Data](https://huggingface.co/osunlp/UGround)
 
32
 
33
  ## Release Plan
34
 
35
+ - [x] [Model Weights](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
36
+ - [x] Initial Version (the one used in the paper)
37
+ - [x] Qwen2-VL-Based V1
38
  - [x] 2B
39
  - [x] 7B
40
+ - [x] 72B
41
+ - [x] Code
42
+ - [x] Inference Code of UGround (Initial & Qwen2-VL-Based
43
+ - [x] Offline Experiments (Code, Results, and Useful Resources)
44
+ - [x] ScreenSpot (along with referring expressions generated by GPT-4/4o)
 
45
  - [x] Multimodal-Mind2Web
46
  - [x] OmniAct
47
+ - [x] Android Control
48
+ - [x] Online Experiments
49
+ - [x] Mind2Web-Live-SeeAct-V
50
+ - [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)
51
+ - [ ] Data Synthesis Pipeline (Coming Soon)
52
+ - [x] Training-Data (V1)
 
 
 
53
  - [x] Online Demo (HF Spaces)
54
 
55
 
56
+
57
+
58
  ## Main Results
59
 
60
+
61
  ### GUI Visual Grounding: ScreenSpot (Standard Setting)
62
 
63
 
64
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/hVwF_cOjLiUF0W0VUyxtp.png)
65
+
66
+ | ScreenSpot (Standard) | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
67
+ | ------------------------------- | ---------------- | ------------------ | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
68
+ | InternVL-2-4B | InternVL-2 | | 9.2 | 4.8 | 4.6 | 4.3 | 0.9 | 0.1 | 4.0 |
69
+ | Groma | Groma | | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 |
70
+ | Qwen-VL | Qwen-VL | | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 |
71
+ | MiniGPT-v2 | MiniGPT-v2 | | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 |
72
+ | GPT-4 | | | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 |
73
+ | GPT-4o | | | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 |
74
+ | Fuyu | Fuyu | | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 |
75
+ | Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
76
+ | Ferret-UI-Llama8b | Ferret-UI | | 64.5 | 32.3 | 45.9 | 11.4 | 28.3 | 11.7 | 32.3 |
77
+ | Qwen2-VL | Qwen2-VL | | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 |
78
+ | CogAgent | CogAgent | | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 |
79
+ | SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
80
+ | OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
81
+ | OmniParser | | | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | 73.0 |
82
+ | **UGround** | LLaVA-UGround-V1 | UGround-V1 | 82.8 | **60.3** | 82.5 | **63.6** | 80.4 | **70.4** | **73.3** |
83
+ | Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
84
+ | ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
85
+ | ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
86
+ | Molmo-7B-D | | | 85.4 | 69.0 | 79.4 | 70.7 | 81.3 | 65.5 | 75.2 |
87
+ | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
88
+ | Molmo-72B | | | 92.7 | 79.5 | 86.1 | 64.3 | 83.0 | 66.0 | 78.6 |
89
+ | Aguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
90
+ | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 81.0 |
91
+ | Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
92
+ | Claude (Computer-Use) | | | **98.2** | **85.6** | 79.9 | 57.1 | **92.2** | **84.5** | 82.9 |
93
+ | Aguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | 95.6 | 77.7 | **93.8** | 67.1 | 88.3 | 75.2 | 83.0 |
94
+ | Project Mariner | | | | | | | | | 84.0 |
95
+ | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL | UGround-V1 | 93.0 | 79.9 | **93.8** | **76.4** | 90.9 | 84.0 | **86.3** |
96
+ | *AGUVIS-72B* | *Qwen2-VL* | *Aguvis-Stage-1&2* | *94.5* | *85.2* | *95.4* | *77.9* | *91.3* | *85.9* | *88.4* |
97
+ | ***UGround-V1-72B (Qwen2-VL)*** | *Qwen2-VL* | *UGround-V1* | *94.1* | *83.4* | *94.9* | *85.7* | *90.4* | *87.9* | *89.4* |
98
+
99
 
100
  ### GUI Visual Grounding: ScreenSpot (Agent Setting)
101