File size: 6,827 Bytes
a5a23fc e49739d c418b29 9a64d06 a5a23fc e49739d 15cdafa d5b3114 3406869 44f4c65 e49739d 3406869 e49739d c418b29 8478037 c418b29 e49739d 2ceeac4 8478037 2ceeac4 6513dfd c418b29 2ceeac4 c418b29 e49739d a4311ee e49739d a4311ee e49739d a4311ee e49739d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
tags:
- clip
- e-commerce
- fashion
- multimodal retrieval
- transformers.js
- transformers
library_name: open_clip
pipeline_tag: zero-shot-image-classification
license: apache-2.0
language:
- en
metrics:
- precision
- recall
- MRR
---
[![GitHub](https://img.shields.io/badge/GitHub-black?logo=github)](https://github.com/marqo-ai/marqo-FashionCLIP)
# Marqo-FashionCLIP Model Card
Marqo-FashionCLIP and Marqo-FashionSigLIP outperform the previous state-of-the-art fashion CLIP models (see results below).
Marqo-FashionCLIP leverages Generalised Contrastive Learning ([GCL](https://www.marqo.ai/blog/generalized-contrastive-learning-for-multi-modal-retrieval-and-ranking)) which allows the model to be trained on not just text descriptions but also categories, style, colors, materials, keywords and fine-details to provide highly relevant search results on fashion products.
The model was fine-tuned from ViT-B-16 (laion2b_s34b_b88k).
**Github Page**: [Marqo-FashionCLIP](https://github.com/marqo-ai/marqo-FashionCLIP)
**Blog**: [Marqo Blog](https://www.marqo.ai/blog/search-model-for-fashion)
## Usage
### Hugging Face
The model can be loaded with AutoModel by
```python
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('Marqo/marqo-fashionCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('Marqo/marqo-fashionCLIP', trust_remote_code=True)
import torch
from PIL import Image
image = [Image.open("docs/fashion-hippo.png")]
text = ["a hat", "a t-shirt", "shoes"]
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")
with torch.no_grad():
image_features = model.get_image_features(processed['pixel_values'], normalize=True)
text_features = model.get_text_features(processed['input_ids'], normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [0.99990773, 0.00006382, 0.00002847]
```
### OpenCLIP
The model can be seamlessly used with [OpenCLIP](https://github.com/mlfoundations/open_clip) by
```python
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')
import torch
from PIL import Image
image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image, normalize=True)
text_features = model.encode_text(text, normalize=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]
```
### Transformers.js
You can also run the model in JavaScript with the [Transformers.js](https://huggingface.co/docs/transformers.js) library.
First, install it from [NPM](https://www.npmjs.com/package/@huggingface/transformers) using:
```bash
npm i @huggingface/transformers
```
Then, compute embeddings as follows:
```js
import { CLIPTextModelWithProjection, CLIPVisionModelWithProjection, AutoTokenizer, AutoProcessor, RawImage, softmax, dot } from '@huggingface/transformers';
const model_id = 'Marqo/marqo-fashionCLIP';
// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);
// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);
// Run tokenization
const texts = ['a hat', 'a t-shirt', 'shoes'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);
// Read image and run processor
const image = await RawImage.read('https://raw.githubusercontent.com/marqo-ai/marqo-FashionCLIP/main/docs/fashion-hippo.png');
const image_inputs = await processor(image);
// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);
// Compute similarity scores
const normalized_text_embeds = text_embeds.normalize().tolist();
const normalized_image_embeds = image_embeds.normalize().tolist()[0];
const text_probs = softmax(normalized_text_embeds.map((text_embed) =>
100.0 * dot(normalized_image_embeds, text_embed)
));
console.log(text_probs);
// [0.9998498302475922, 0.000119267522939106, 0.000030902229468640687]
```
## Benchmark Results
Average evaluation results on 6 public multimodal fashion datasets ([Atlas](https://huggingface.co/datasets/Marqo/atlas), [DeepFashion (In-shop)](https://huggingface.co/datasets/Marqo/deepfashion-inshop), [DeepFashion (Multimodal)](https://huggingface.co/datasets/Marqo/deepfashion-multimodal), [Fashion200k](https://huggingface.co/datasets/Marqo/fashion200k), [KAGL](https://huggingface.co/datasets/Marqo/KAGL), and [Polyvore](https://huggingface.co/datasets/Marqo/polyvore)) are reported below:
**Text-To-Image (Averaged across 6 datasets)**
| Model | AvgRecall | Recall@1 | Recall@10 | MRR |
|----------------------------|-------------|------------|-------------|-----------|
| Marqo-FashionCLIP | **0.192** | **0.094** | **0.290** | **0.200** |
| FashionCLIP2.0 | 0.163 | 0.077 | 0.249 | 0.165 |
| OpenFashionCLIP | 0.132 | 0.060 | 0.204 | 0.135 |
| ViT-B-16-laion2b_s34b_b88k | 0.174 | 0.088 | 0.261 | 0.180 |
**Category-To-Product (Averaged across 5 datasets)**
| Model | AvgP | P@1 | P@10 | MRR |
|----------------------------|-----------|-----------|-----------|-----------|
| Marqo-FashionCLIP | **0.705** | **0.734** | 0.676 | **0.776** |
| FashionCLIP2.0 | 0.684 | 0.681 | **0.686** | 0.741 |
| OpenFashionCLIP | 0.646 | 0.653 | 0.639 | 0.720 |
| ViT-B-16-laion2b_s34b_b88k | 0.662 | 0.673 | 0.652 | 0.743 |
**Sub-Category-To-Product (Averaged across 4 datasets)**
| Model | AvgP | P@1 | P@10 | MRR |
|----------------------------|-----------|-----------|-----------|-----------|
| Marqo-FashionCLIP | **0.707** | **0.747** | **0.667** | **0.772** |
| FashionCLIP2.0 | 0.657 | 0.676 | 0.638 | 0.733 |
| OpenFashionCLIP | 0.598 | 0.619 | 0.578 | 0.689 |
| ViT-B-16-laion2b_s34b_b88k | 0.638 | 0.651 | 0.624 | 0.712 | |