|
--- |
|
license: mit |
|
--- |
|
|
|
# Conditional ViT - B/16 - Categories |
|
|
|
*Introduced in **Weakly-Supervised Conditional Embedding for Referred Visual Search**, Lepage et al. 2023* |
|
|
|
[`Paper`](https://arxiv.org/abs/2306.02928) | [`Training Data`](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion) | [`Training Code`](https://github.com/Simon-Lepage/CondViT-LRVSF) | [`Demo`](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo) |
|
|
|
## General Infos |
|
|
|
Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning categories are the following : |
|
- Bags |
|
- Feet |
|
- Hands |
|
- Head |
|
- Lower Body |
|
- Neck |
|
- Outwear |
|
- Upper Body |
|
- Waist |
|
- Whole Body |
|
|
|
Research use only. |
|
|
|
## How to Use |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoProcessor, AutoModel |
|
import torch |
|
|
|
model = AutoModel.from_pretrained("Slep/CondViT-B16-cat") |
|
processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-cat") |
|
|
|
url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg" |
|
img = Image.open(requests.get(url, stream=True).raw) |
|
cat = "Bags" |
|
|
|
inputs = processor(images=[img], categories=[cat]) |
|
raw_embedding = model(**inputs) |
|
normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1) |
|
``` |