microsoft/Florence-2-large · Limitations of Handling Large Quantities of Objects

Hi everyone,

I'm working on fine-tuning Florence for an object detection task, but I’ve encountered some limitations when dealing with a large number of objects in an image. My domain involves detecting products on shelves, and I realized that Florence might not be the best fit due to its token limitations.

For instance, if the model has a maximum limit of 4096 characters, and a single detected object is represented as Product, that already takes up 45 tokens/characters. This means that, on average, I can only detect around 90 objects per image before hitting constraints.

Given these limitations, what strategies could be applied to improve performance in this scenario? Would modifying the output format or adjusting to would be viable solutions?

Thanks!