Documentation

#1
by iiBLACKii - opened

How can I pass an image to an LLM for analysis and obtain output in the form of text, bounding boxes, or coordinates? My current use case involves detecting specific UI elements, such as cards on a landing page, but the model isn't successfully identifying these elements. How can I improve the detection process to better recognize such elements?

AgentSea org

Are you using the "detect" key word? Be very very very simple. detect X.

Examples:

detect 'calendar'
detect 'submit' button
detect 'file' drop down menu

Ensure your resolution is not too low in the image. Consider processing it with a non lossy format.

Consider using the 896 model as it is more performant and deals with more detail.

Like there is not specific but if I pass Image it should identify presented UI element from image like buttons, links any UI element. Coordinates would also be ok if possible . (Sorry if anything I am missing actually I am new to LLM)
detected_ui_layout_with_coordinates.png

AgentSea org

Hey @iiBLACKii - as Dan said you only need to prompt the model with something like 'detect X'.

So, for instance, for the image you provided 'detect start your project button' should give you the normalized coordinates for that button.

Is it possible to detect UI element itself without 'detect X' ?

Sign up or log in to comment