README.md · apple/DFN5B-CLIP-ViT-H-14 at e1e58902fd228a4288d13f95c82d93a1a32f48d0

metadata

license: other
license_name: apple-sample-code-license
license_link: LICENSE

A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-5B. Data Filtering Networks (DFNs) are small used to automatically filter large pools of uncurated data. This model was trained on 5B images that were filtered from a pool of 43B uncurated image-text pairs (12.8B image-text pairs from CommonPool-12.8B + 30B additional public image-text pairs).

This model has been converted to PyTorch from the original JAX checkpoints from Axlearn (https://github.com/apple/axlearn). These weights are directly usable in OpenCLIP (image + text).

Model Details

Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
Dataset: DFN-5b
Papers:
- Data Filtering Networks: https://arxiv.org/abs/2309.17425

Metrics

dataset	metric
ImageNet 1k	0.8344
Caltech-101	0.954935
CIFAR-10	0.9878
CIFAR-100	0.9051
CLEVR Counts	0.2966
CLEVR Distance	0.2124
Country211	0.343981
Describable Textures	0.706383
EuroSAT	0.654815
FGVC Aircraft	0.714055
Food-101	0.956792
GTSRB	0.677514
ImageNet Sketch	0.727308
ImageNet v2	0.773
ImageNet-A	0.6988
ImageNet-O	0.381
ImageNet-R	0.929367
KITTI Vehicle Distance	0.336146
MNIST	0.8579
ObjectNet	0.681275
Oxford Flowers-102	0.899534
Oxford-IIIT Pet	0.965515
Pascal VOC 2007	0.818309
PatchCamelyon	0.653625
Rendered SST2	0.546403
RESISC45	0.750476
Stanford Cars	0.957592
STL-10	0.989
SUN397	0.769149
SVHN	0.676168
Flickr	0.8645
MSCOCO	0.631112
WinoGAViL	0.556329
iWildCam	0.205549
Camelyon17	0.705034
FMoW	0.207482
Dollar Street	0.699766
GeoDE	0.928184

Model Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer 

model, preprocess = create_model_from_pretrained('hf-hub:apple/DFN5B-CLIP-ViT-H-14')
tokenizer = get_tokenizer('hf-hub:apple/DFN5B-CLIP-ViT-H-14)

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

Citation

@article{fang2023data,
  title={Data Filtering Networks},
  author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
  journal={arXiv preprint arXiv:2309.17425},
  year={2023}
}