nolan4 commited on
Commit
0647efc
·
verified ·
1 Parent(s): b7d8382

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -8
README.md CHANGED
@@ -5,6 +5,7 @@ datasets:
5
  base_model:
6
  - answerdotai/ModernBERT-base
7
  - HuggingFaceTB/SmolVLM-Instruct
 
8
  ---
9
  # Model Card for Model ID
10
 
@@ -14,25 +15,25 @@ Use natural language to search for images.<br>
14
 
15
  # How to Get Started with the Model
16
 
17
- To use a pretrained model to search a directory of images, go to demo.py. For training, see train.py.<br>
18
 
19
  # Model Details
20
- **Text encoder (tuned):** modernBERT-base<br>
21
  https://huggingface.co/answerdotai/ModernBERT-base<br>
22
- **Vision encoder (frozen):** IdeficsV3 variant extracted from HF's smolVLM!<br>
23
- https://huggingface.co/blog/smolvlm
24
 
25
  # Model Description
26
 
27
  ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
28
- It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (IdeficsV3 from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using
29
  linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
30
  Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
31
 
32
  # Datasets
33
 
34
- flickr30k: https://huggingface.co/datasets/nlphuji/flickr30k<br> (training)
35
- Coco-captioning: https://cocodataset.org/#captions-2015 (demo)
36
 
37
  # Training Procedure
38
 
@@ -43,4 +44,4 @@ The model is trained using the InfoNCE contrastive loss, which encourages positi
43
 
44
  # Hardware
45
 
46
- Nvidia 3080 Ti
 
5
  base_model:
6
  - answerdotai/ModernBERT-base
7
  - HuggingFaceTB/SmolVLM-Instruct
8
+ pipeline_tag: zero-shot-image-classification
9
  ---
10
  # Model Card for Model ID
11
 
 
15
 
16
  # How to Get Started with the Model
17
 
18
+ To use a pretrained model to search through a directory of images, go to demo.py. For training, see train.py.<br>
19
 
20
  # Model Details
21
+ **Text encoder:** modernBERT-base<br>
22
  https://huggingface.co/answerdotai/ModernBERT-base<br>
23
+ **Vision encoder:** IdeficsV3 variant extracted from HF's smolVLM!<br>
24
+ https://huggingface.co/blog/smolvlm<br>
25
 
26
  # Model Description
27
 
28
  ModernBERT-base-CLIP is a multimodal model for Contrastive Language-Image Pretraining (CLIP), designed to align text and image representations in a shared embedding space.
29
+ It leverages a fine-tuned ModernBERT-base text encoder and a frozen vision encoder (extracted from SmolVLM) to generate embeddings, which are projected into a 512-dimensional space using
30
  linear layers. The model enables natural language-based image retrieval and zero-shot classification by optimizing a contrastive loss, which maximizes the similarity between matching text-image pairs while minimizing the similarity for non-matching pairs.
31
  Training was conducted on the Flickr30k dataset, with one-shot evaluation performed on COCO images (... or your own!) using the demo.py script,
32
 
33
  # Datasets
34
 
35
+ flickr30k: https://huggingface.co/datasets/nlphuji/flickr30 (training)<br>
36
+ Coco-captioning: https://cocodataset.org/#captions-2015 (demo)<br>
37
 
38
  # Training Procedure
39
 
 
44
 
45
  # Hardware
46
 
47
+ Nvidia 3080 Ti