Crop boxes not aligned with parts
Hey Johan! Super cool project! Here’s some feedback from when I tried it. :)
One thing I noticed when trying the app was that the crop boxes that resulted when I clicked on a part did not always align perfectly. It seems like you have pre-sampled a grid of crop boxes, computed and compared their embeddings and use the closest box to the user's mouse click. It does the job but feels a bit awkward to use, especially if you don't understand how it works. Here are some suggestions for how this could be improved:
- Highlight the resulting crop box on mouse hover. This would make it more intuitive for the user, but not sure if it would be problematic to listen to mouse movements in real time.
- Sample a denser grid, so the resulting crop box can be better aligned with mouse clicks. If this gives performance issues, you could try doing a foreground/background classification and only use the crops that contain objects. Or embed all crop boxes but use efficient nearest-neighbor search in the embedding space to reduce the number of comparisons.
You could also change approach entirely. I suppose ideally you only want crop boxes that fit neatly around each part. In this way, you don't need to embed a bunch of background crops and could handle parts at multiple scales. Perhaps you could find a way of doing unsupervised object detection and/or segmentation and use these to create the crops?
Now that I think about it, perhaps you could even go down the road of auto-annotating parts and training a smaller model on these? Heavy-duty foundation models that generalize well to new data are perfect for this, for example Meta's newly released SAM.
Yes, I have tried running images through SAM and it works well, so that's definitely a way to go. Let's consider that a long term goal!
You are right that it's using a sampled grid and I see why that could cause confusion, you are not the only one complaining about it.
For the short term I'm considering letting the user pick any box (grid size = 1px), and run inference on that 64x64 patch. Being able to run inference in the web app will be useful later as it would allow users to upload images!