CLIP Model based on DistilBERT and ViT

This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:

DistilBERT (based on distilbert-base-uncased): A smaller, faster, and lighter version of BERT.
Vision Transformer (ViT) (based on google/vit-base-patch16-224): A powerful vision transformer architecture for image processing.

The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.

Model Overview

CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.

Components:

Text Encoder: distilbert-base-uncased is used to encode the textual input into a dense vector.
Image Encoder: google/vit-base-patch16-224 processes image data by dividing images into patches and learning their contextual relationships.

Future work:

Train over larger datasets and with more computer resources

sebastiansarasti
/

clip_fashion

CLIP Model based on DistilBERT and ViT

Model Overview

Components:

Future work:

Model tree for sebastiansarasti/clip_fashion

Dataset used to train sebastiansarasti/clip_fashion

Space using sebastiansarasti/clip_fashion 1