Documentation

by iiBLACKii - opened Oct 3, 2024

Oct 3, 2024

How can I pass an image to an LLM for analysis and obtain output in the form of text, bounding boxes, or coordinates? My current use case involves detecting specific UI elements, such as cards on a landing page, but the model isn't successfully identifying these elements. How can I improve the detection process to better recognize such elements?

danjeffries

AgentSea org Oct 3, 2024

Are you using the "detect" key word? Be very very very simple. detect X.

Examples:

detect 'calendar'
detect 'submit' button
detect 'file' drop down menu

Ensure your resolution is not too low in the image. Consider processing it with a non lossy format.

Consider using the 896 model as it is more performant and deals with more detail.

iiBLACKii

Oct 3, 2024

•

edited Oct 3, 2024

Like there is not specific but if I pass Image it should identify presented UI element from image like buttons, links any UI element. Coordinates would also be ok if possible . (Sorry if anything I am missing actually I am new to LLM)

nph4rd

AgentSea org Oct 3, 2024

Hey @iiBLACKii - as Dan said you only need to prompt the model with something like 'detect X'.

So, for instance, for the image you provided 'detect start your project button' should give you the normalized coordinates for that button.

iiBLACKii

Oct 4, 2024

•

edited Oct 4, 2024

Is it possible to detect UI element itself without 'detect X' ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment