Update README.md
Browse files
README.md
CHANGED
@@ -5,6 +5,7 @@ datasets:
|
|
5 |
base_model:
|
6 |
- answerdotai/ModernBERT-base
|
7 |
- HuggingFaceTB/SmolVLM-Instruct
|
|
|
8 |
---
|
9 |
# Model Card for Model ID
|
10 |
|
@@ -14,25 +15,25 @@ Use natural language to search for images.<br>
|
|
14 |
|
15 |
# How to Get Started with the Model
|
16 |
|
17 |
-
To use a pretrained model to search a directory of images, go to demo.py. For training, see train.py.<br>
|
18 |
|
19 |
# Model Details
|
20 |
-
**Text encoder
|
21 |
https://huggingface.co/answerdotai/ModernBERT-base<br>
|
22 |
-
**Vision encoder
|
23 |
-
https://huggingface.co/blog/smolvlm
|
24 |
|
25 |
# Model Description
|
26 |
|
27 |
ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
|
28 |
-
It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (
|
29 |
linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
|
30 |
Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
|
31 |
|
32 |
# Datasets
|
33 |
|
34 |
-
flickr30k: https://huggingface.co/datasets/nlphuji/
|
35 |
-
Coco-captioning: https://cocodataset.org/#captions-2015 (demo)
|
36 |
|
37 |
# Training Procedure
|
38 |
|
@@ -43,4 +44,4 @@ The model is trained using the InfoNCE contrastive loss, which encourages positi
|
|
43 |
|
44 |
# Hardware
|
45 |
|
46 |
-
Nvidia 3080 Ti
|
|
|
5 |
base_model:
|
6 |
- answerdotai/ModernBERT-base
|
7 |
- HuggingFaceTB/SmolVLM-Instruct
|
8 |
+
pipeline_tag: zero-shot-image-classification
|
9 |
---
|
10 |
# Model Card for Model ID
|
11 |
|
|
|
15 |
|
16 |
# How to Get Started with the Model
|
17 |
|
18 |
+
To use a pretrained model to search through a directory of images, go to demo.py. For training, see train.py.<br>
|
19 |
|
20 |
# Model Details
|
21 |
+
**Text encoder:** modernBERT-base<br>
|
22 |
https://huggingface.co/answerdotai/ModernBERT-base<br>
|
23 |
+
**Vision encoder:** IdeficsV3 variant extracted from HF's smolVLM!<br>
|
24 |
+
https://huggingface.co/blog/smolvlm<br>
|
25 |
|
26 |
# Model Description
|
27 |
|
28 |
ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
|
29 |
+
It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (extracted from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using
|
30 |
linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
|
31 |
Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
|
32 |
|
33 |
# Datasets
|
34 |
|
35 |
+
flickr30k: https://huggingface.co/datasets/nlphuji/flickr30 (training)<br>
|
36 |
+
Coco-captioning: https://cocodataset.org/#captions-2015 (demo)<br>
|
37 |
|
38 |
# Training Procedure
|
39 |
|
|
|
44 |
|
45 |
# Hardware
|
46 |
|
47 |
+
Nvidia 3080 Ti
|