Fine Tuning fashionCLIP for image-image search
Hi all, I've been working on image-image search tasks and fashionCLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the fashionCLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?
if you are only doing image search, why don't you just use image transformer models without text encoders?
Hi all, I've been working on image-image search tasks and fashionCLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the fashionCLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?
Did you find another approach? I too am working on Image to Image search on fashion items.