--- license: mit --- # Conditional ViT - B/16 - Categories *Introduced in **Weakly-Supervised Conditional Embedding for Referred Visual Search**, Lepage et al. 2023* [`Paper`](https://arxiv.org/abs/2306.02928) | [`Training Data`](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion) | [`Training Code`](https://github.com/Simon-Lepage/CondViT-LRVSF) | [`Demo`](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo) ## General Infos Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning categories are the following : - Bags - Feet - Hands - Head - Lower Body - Neck - Outwear - Upper Body - Waist - Whole Body Research use only. ## How to Use ```python from PIL import Image import requests from transformers import AutoProcessor, AutoModel import torch model = AutoModel.from_pretrained("Slep/CondViT-B16-cat") processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-cat") url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg" img = Image.open(requests.get(url, stream=True).raw) cat = "Bags" inputs = processor(images=[img], categories=[cat]) raw_embedding = model(**inputs) normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1) ```