|
---
|
|
license: cc-by-nc-4.0
|
|
model-index:
|
|
- name: CondViT-B16-cat
|
|
results:
|
|
- dataset:
|
|
name: LAION - Referred Visual Search - Fashion
|
|
split: test
|
|
type: Slep/LAION-RVS-Fashion
|
|
metrics:
|
|
- name: R@1 +10K Dist.
|
|
type: recall_at_1|10000
|
|
value: 93.44 ± 0.83
|
|
- name: R@5 +10K Dist.
|
|
type: recall_at_5|10000
|
|
value: 98.07 ± 0.37
|
|
- name: R@10 +10K Dist.
|
|
type: recall_at_10|10000
|
|
value: 98.69 ± 0.38
|
|
- name: R@20 +10K Dist.
|
|
type: recall_at_20|10000
|
|
value: 98.98 ± 0.34
|
|
- name: R@50 +10K Dist.
|
|
type: recall_at_50|10000
|
|
value: 99.55 ± 0.18
|
|
- name: R@1 +100K Dist.
|
|
type: recall_at_1|100000
|
|
value: 85.90 ± 1.37
|
|
- name: R@5 +100K Dist.
|
|
type: recall_at_5|100000
|
|
value: 94.22 ± 0.87
|
|
- name: R@10 +100K Dist.
|
|
type: recall_at_10|100000
|
|
value: 96.04 ± 0.68
|
|
- name: R@20 +100K Dist.
|
|
type: recall_at_20|100000
|
|
value: 97.18 ± 0.56
|
|
- name: R@50 +100K Dist.
|
|
type: recall_at_50|100000
|
|
value: 98.28 ± 0.34
|
|
- name: R@1 +500K Dist.
|
|
type: recall_at_1|500000
|
|
value: 78.19 ± 1.59
|
|
- name: R@5 +500K Dist.
|
|
type: recall_at_5|500000
|
|
value: 88.70 ± 1.15
|
|
- name: R@10 +500K Dist.
|
|
type: recall_at_10|500000
|
|
value: 91.46 ± 1.02
|
|
- name: R@20 +500K Dist.
|
|
type: recall_at_20|500000
|
|
value: 94.07 ± 0.86
|
|
- name: R@50 +500K Dist.
|
|
type: recall_at_50|500000
|
|
value: 96.11 ± 0.64
|
|
- name: R@1 +1M Dist.
|
|
type: recall_at_1|1000000
|
|
value: 74.49 ± 1.23
|
|
- name: R@5 +1M Dist.
|
|
type: recall_at_5|1000000
|
|
value: 85.38 ± 1.29
|
|
- name: R@10 +1M Dist.
|
|
type: recall_at_10|1000000
|
|
value: 88.95 ± 1.15
|
|
- name: R@20 +1M Dist.
|
|
type: recall_at_20|1000000
|
|
value: 91.35 ± 0.93
|
|
- name: R@50 +1M Dist.
|
|
type: recall_at_50|1000000
|
|
value: 94.75 ± 0.75
|
|
- name: Available Dists.
|
|
type: n_dists
|
|
value: 2000014
|
|
- name: Embedding Dimension
|
|
type: embedding_dim
|
|
value: 512
|
|
- name: Conditioning
|
|
type: conditioning
|
|
value: category
|
|
source:
|
|
name: LRVSF Leaderboard
|
|
url: https://huggingface.co/spaces/Slep/LRVSF-Leaderboard
|
|
task:
|
|
type: Retrieval
|
|
tags:
|
|
- lrvsf-benchmark
|
|
datasets:
|
|
- Slep/LAION-RVS-Fashion
|
|
---
|
|
|
|
# Conditional ViT - B/16 - Categories |
|
|
|
*Introduced in <a href=https://arxiv.org/abs/2306.02928>**LRVSF-Fashion: Extending Visual Search with Referring Instructions**</a>, Lepage et al. 2023* |
|
|
|
<div align="center"> |
|
<div id=links> |
|
|
|
|Data|Code|Models|Spaces| |
|
|:-:|:-:|:-:|:-:| |
|
|[Full Dataset](https://huggingface.co/datasets/Slep/LAION-RVS-Fashion)|[Training Code](https://github.com/Simon-Lepage/CondViT-LRVSF)|[Categorical Model](https://huggingface.co/Slep/CondViT-B16-cat)|[LRVS-F Leaderboard](https://huggingface.co/spaces/Slep/LRVSF-Leaderboard)| |
|
|[Test set](https://zenodo.org/doi/10.5281/zenodo.11189942)|[Benchmark Code](https://github.com/Simon-Lepage/LRVSF-Benchmark)|[Textual Model](https://huggingface.co/Slep/CondViT-B16-txt)|[Demo](https://huggingface.co/spaces/Slep/CondViT-LRVSF-Demo)| |
|
</div> |
|
</div> |
|
|
|
## General Infos |
|
|
|
Model finetuned from CLIP ViT-B/16 on LRVSF at 224x224. The conditioning categories are the following : |
|
- Bags |
|
- Feet |
|
- Hands |
|
- Head |
|
- Lower Body |
|
- Neck |
|
- Outwear |
|
- Upper Body |
|
- Waist |
|
- Whole Body |
|
|
|
Research use only. |
|
|
|
## How to Use |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoProcessor, AutoModel |
|
import torch |
|
|
|
model = AutoModel.from_pretrained("Slep/CondViT-B16-cat") |
|
processor = AutoProcessor.from_pretrained("Slep/CondViT-B16-cat") |
|
|
|
url = "https://huggingface.co/datasets/Slep/LAION-RVS-Fashion/resolve/main/assets/108856.0.jpg" |
|
img = Image.open(requests.get(url, stream=True).raw) |
|
cat = "Bags" |
|
|
|
inputs = processor(images=[img], categories=[cat]) |
|
raw_embedding = model(**inputs) |
|
normalized_embedding = torch.nn.functional.normalize(raw_embedding, dim=-1) |
|
``` |