Image-Text-to-Text
Safetensors
llava_llama

UGround

UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. radar

image/png

Citation Information

If you find this work useful, please consider citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }
Downloads last month
18,121
Safetensors
Model size
7.06B params
Tensor type
FP16
Β·
Inference API
Unable to determine this model's library. Check the docs .

Spaces using osunlp/UGround 2