Spaces:

yuwd
/

Polos-Demo

Sleeping

App Files Files Community

yuwd commited on Jun 12, 2024

Commit

03f6091

1 Parent(s): e7c260c

init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

CLIP/.github/workflows/test.yml +33 -0
CLIP/.gitignore +10 -0
CLIP/CLIP.png +0 -0
CLIP/LICENSE +22 -0
CLIP/MANIFEST.in +1 -0
CLIP/README.md +199 -0
CLIP/clip/__init__.py +1 -0
CLIP/clip/bpe_simple_vocab_16e6.txt.gz +3 -0
CLIP/clip/clip.py +245 -0
CLIP/clip/model.py +436 -0
CLIP/clip/simple_tokenizer.py +132 -0
CLIP/data/country211.md +12 -0
CLIP/data/prompts.md +3401 -0
CLIP/data/rendered-sst2.md +11 -0
CLIP/data/yfcc100m.md +14 -0
CLIP/hubconf.py +42 -0
CLIP/model-card.md +120 -0
CLIP/notebooks/Interacting_with_CLIP.ipynb +0 -0
CLIP/notebooks/Prompt_Engineering_for_ImageNet.ipynb +1107 -0
CLIP/requirements.txt +5 -0
CLIP/setup.py +21 -0
CLIP/tests/test_consistency.py +25 -0
Dockerfile +6 -0
Dockerfile~ +6 -0
LICENSE +33 -0
README.md +10 -12
app.py +33 -141
configs/polos-trainer.yaml +46 -0
docker.sh +4 -0
install.sh +4 -0
pacscore/README.md +135 -0
pacscore/compute_correlations.py +111 -0
pacscore/compute_metrics.py +110 -0
pacscore/data/__init__.py +6 -0
pacscore/data/dataset.py +55 -0
pacscore/data/tokenizer/__init__.py +0 -0
pacscore/data/tokenizer/bpe_simple_vocab_16e6.txt.gz +3 -0
pacscore/data/tokenizer/simple_tokenizer.py +144 -0
pacscore/environment.yml +92 -0
pacscore/evaluation/__init__.py +44 -0
pacscore/evaluation/pac_score/__init__.py +1 -0
pacscore/evaluation/pac_score/pac_score.py +133 -0
pacscore/evaluation/tokenizer.py +63 -0
pacscore/example/bad_captions.json +5 -0
pacscore/example/good_captions.json +4 -0
pacscore/example/images/image1.jpg +0 -0
pacscore/example/images/image2.jpg +0 -0
pacscore/example/refs.json +18 -0
pacscore/images/model.png +0 -0
pacscore/models/__init__.py +0 -0

CLIP/.github/workflows/test.yml ADDED Viewed

	@@ -0,0 +1,33 @@

+name: test
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+jobs:
+  CLIP-test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.8]
+        pytorch-version: [1.7.1, 1.9.1, 1.10.1]
+        include:
+          - python-version: 3.8
+            pytorch-version: 1.7.1
+            torchvision-version: 0.8.2
+          - python-version: 3.8
+            pytorch-version: 1.9.1
+            torchvision-version: 0.10.1
+          - python-version: 3.8
+            pytorch-version: 1.10.1
+            torchvision-version: 0.11.2
+    steps:
+      - uses: conda-incubator/setup-miniconda@v2
+      - run: conda install -n test python=${{ matrix.python-version }} pytorch=${{ matrix.pytorch-version }} torchvision=${{ matrix.torchvision-version }} cpuonly -c pytorch
+      - uses: actions/checkout@v2
+      - run: echo "$CONDA/envs/test/bin" >> $GITHUB_PATH
+      - run: pip install pytest
+      - run: pip install .
+      - run: pytest

CLIP/.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info
+.pytest_cache
+.ipynb_checkpoints
+thumbs.db
+.DS_Store
+.idea

CLIP/CLIP.png ADDED Viewed

CLIP/LICENSE ADDED Viewed

	@@ -0,0 +1,22 @@

+MIT License
+Copyright (c) 2021 OpenAI
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

CLIP/MANIFEST.in ADDED Viewed

	@@ -0,0 +1 @@


1	+ include clip/bpe_simple_vocab_16e6.txt.gz

CLIP/README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+# CLIP
+[[Blog]](https://openai.com/blog/clip/) [[Paper]](https://arxiv.org/abs/2103.00020) [[Model Card]](model-card.md) [[Colab]](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb)
+CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.
+## Approach
+![CLIP](CLIP.png)
+## Usage
+First, [install PyTorch 1.7.1](https://pytorch.org/get-started/locally/) (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:
+```bash
+$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
+$ pip install ftfy regex tqdm
+$ pip install git+https://github.com/openai/CLIP.git
+```
+Replace `cudatoolkit=11.0` above with the appropriate CUDA version on your machine or `cpuonly` when installing on a machine without a GPU.
+```python
+import torch
+import clip
+from PIL import Image
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model, preprocess = clip.load("ViT-B/32", device=device)
+image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
+text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
+with torch.no_grad():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    logits_per_image, logits_per_text = model(image, text)
+    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
+print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]
+```
+## API
+The CLIP module `clip` provides the following methods:
+#### `clip.available_models()`
+Returns the names of the available CLIP models.
+#### `clip.load(name, device=..., jit=False)`
+Returns the model and the TorchVision transform needed by the model, specified by the model name returned by `clip.available_models()`. It will download the model as necessary. The `name` argument can also be a path to a local checkpoint.
+The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When `jit` is `False`, a non-JIT version of the model will be loaded.
+#### `clip.tokenize(text: Union[str, List[str]], context_length=77)`
+Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model
+---
+The model returned by `clip.load()` supports the following methods:
+#### `model.encode_image(image: Tensor)`
+Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.
+#### `model.encode_text(text: Tensor)`
+Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.
+#### `model(image: Tensor, text: Tensor)`
+Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.
+## More Examples
+### Zero-Shot Prediction
+The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the [CIFAR-100 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), and predicts the most likely labels among the 100 textual labels from the dataset.
+```python
+import os
+import clip
+import torch
+from torchvision.datasets import CIFAR100
+# Load the model
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model, preprocess = clip.load('ViT-B/32', device)
+# Download the dataset
+cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
+# Prepare the inputs
+image, class_id = cifar100[3637]
+image_input = preprocess(image).unsqueeze(0).to(device)
+text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
+# Calculate features
+with torch.no_grad():
+    image_features = model.encode_image(image_input)
+    text_features = model.encode_text(text_inputs)
+# Pick the top 5 most similar labels for the image
+image_features /= image_features.norm(dim=-1, keepdim=True)
+text_features /= text_features.norm(dim=-1, keepdim=True)
+similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+values, indices = similarity[0].topk(5)
+# Print the result
+print("\nTop predictions:\n")
+for value, index in zip(values, indices):
+    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
+```
+The output will look like the following (the exact numbers may be slightly different depending on the compute device):
+```
+Top predictions:
+           snake: 65.31%
+          turtle: 12.29%
+    sweet_pepper: 3.83%
+          lizard: 1.88%
+       crocodile: 1.75%
+```
+Note that this example uses the `encode_image()` and `encode_text()` methods that return the encoded features of given inputs.
+### Linear-probe evaluation
+The example below uses [scikit-learn](https://scikit-learn.org/) to perform logistic regression on image features.
+```python
+import os
+import clip
+import torch
+import numpy as np
+from sklearn.linear_model import LogisticRegression
+from torch.utils.data import DataLoader
+from torchvision.datasets import CIFAR100
+from tqdm import tqdm
+# Load the model
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model, preprocess = clip.load('ViT-B/32', device)
+# Load the dataset
+root = os.path.expanduser("~/.cache")
+train = CIFAR100(root, download=True, train=True, transform=preprocess)
+test = CIFAR100(root, download=True, train=False, transform=preprocess)
+def get_features(dataset):
+    all_features = []
+    all_labels = []
+    with torch.no_grad():
+        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
+            features = model.encode_image(images.to(device))
+            all_features.append(features)
+            all_labels.append(labels)
+    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()
+# Calculate the image features
+train_features, train_labels = get_features(train)
+test_features, test_labels = get_features(test)
+# Perform logistic regression
+classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
+classifier.fit(train_features, train_labels)
+# Evaluate using the logistic regression classifier
+predictions = classifier.predict(test_features)
+accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
+print(f"Accuracy = {accuracy:.3f}")
+```
+Note that the `C` value should be determined via a hyperparameter sweep using a validation split.
+## See Also
+* [OpenCLIP](https://github.com/mlfoundations/open_clip): includes larger and independently trained CLIP models up to ViT-G/14
+* [Hugging Face implementation of CLIP](https://huggingface.co/docs/transformers/model_doc/clip): for easier integration with the HF ecosystem

CLIP/clip/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .clip import *

CLIP/clip/bpe_simple_vocab_16e6.txt.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
+size 1356917

CLIP/clip/clip.py ADDED Viewed

	@@ -0,0 +1,245 @@

+import hashlib
+import os
+import urllib
+import warnings
+from typing import Any, Union, List
+from pkg_resources import packaging
+import torch
+from PIL import Image
+from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
+from tqdm import tqdm
+from .model import build_model
+from .simple_tokenizer import SimpleTokenizer as _Tokenizer
+try:
+    from torchvision.transforms import InterpolationMode
+    BICUBIC = InterpolationMode.BICUBIC
+except ImportError:
+    BICUBIC = Image.BICUBIC
+if packaging.version.parse(torch.__version__) < packaging.version.parse("1.7.1"):
+    warnings.warn("PyTorch version 1.7.1 or higher is recommended")
+__all__ = ["available_models", "load", "tokenize"]
+_tokenizer = _Tokenizer()
+_MODELS = {
+    "RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
+    "RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
+    "RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
+    "RN50x16": "https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt",
+    "RN50x64": "https://openaipublic.azureedge.net/clip/models/be1cfb55d75a9666199fb2206c106743da0f6468c9d327f3e0d0a543a9919d9c/RN50x64.pt",
+    "ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
+    "ViT-B/16": "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt",
+    "ViT-L/14": "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt",
+    "ViT-L/14@336px": "https://openaipublic.azureedge.net/clip/models/3035c92b350959924f9f00213499208652fc7ea050643e8b385c2dac08641f02/ViT-L-14-336px.pt",
+}
+def _download(url: str, root: str):
+    os.makedirs(root, exist_ok=True)
+    filename = os.path.basename(url)
+    expected_sha256 = url.split("/")[-2]
+    download_target = os.path.join(root, filename)
+    if os.path.exists(download_target) and not os.path.isfile(download_target):
+        raise RuntimeError(f"{download_target} exists and is not a regular file")
+    if os.path.isfile(download_target):
+        if hashlib.sha256(open(download_target, "rb").read()).hexdigest() == expected_sha256:
+            return download_target
+        else:
+            warnings.warn(f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file")
+    with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
+        with tqdm(total=int(source.info().get("Content-Length")), ncols=80, unit='iB', unit_scale=True, unit_divisor=1024) as loop:
+            while True:
+                buffer = source.read(8192)
+                if not buffer:
+                    break
+                output.write(buffer)
+                loop.update(len(buffer))
+    if hashlib.sha256(open(download_target, "rb").read()).hexdigest() != expected_sha256:
+        raise RuntimeError("Model has been downloaded but the SHA256 checksum does not not match")
+    return download_target
+def _convert_image_to_rgb(image):
+    return image.convert("RGB")
+def _transform(n_px):
+    return Compose([
+        Resize(n_px, interpolation=BICUBIC),
+        CenterCrop(n_px),
+        _convert_image_to_rgb,
+        ToTensor(),
+        Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
+    ])
+def available_models() -> List[str]:
+    """Returns the names of available CLIP models"""
+    return list(_MODELS.keys())
+def load(name: str, device: Union[str, torch.device] = "cuda" if torch.cuda.is_available() else "cpu", jit: bool = False, download_root: str = None):
+    """Load a CLIP model
+    Parameters
+    ----------
+    name : str
+        A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict
+    device : Union[str, torch.device]
+        The device to put the loaded model
+    jit : bool
+        Whether to load the optimized JIT model or more hackable non-JIT model (default).
+    download_root: str
+        path to download the model files; by default, it uses "~/.cache/clip"
+    Returns
+    -------
+    model : torch.nn.Module
+        The CLIP model
+    preprocess : Callable[[PIL.Image], torch.Tensor]
+        A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
+    """
+    if name in _MODELS:
+        model_path = _download(_MODELS[name], download_root or os.path.expanduser("~/.cache/clip"))
+    elif os.path.isfile(name):
+        model_path = name
+    else:
+        raise RuntimeError(f"Model {name} not found; available models = {available_models()}")
+    with open(model_path, 'rb') as opened_file:
+        try:
+            # loading JIT archive
+            model = torch.jit.load(opened_file, map_location=device if jit else "cpu").eval()
+            state_dict = None
+        except RuntimeError:
+            # loading saved state dict
+            if jit:
+                warnings.warn(f"File {model_path} is not a JIT archive. Loading as a state dict instead")
+                jit = False
+            state_dict = torch.load(opened_file, map_location="cpu")
+    if not jit:
+        model = build_model(state_dict or model.state_dict()).to(device)
+        if str(device) == "cpu":
+            model.float()
+        return model, _transform(model.visual.input_resolution)
+    # patch the device names
+    device_holder = torch.jit.trace(lambda: torch.ones([]).to(torch.device(device)), example_inputs=[])
+    device_node = [n for n in device_holder.graph.findAllNodes("prim::Constant") if "Device" in repr(n)][-1]
+    def _node_get(node: torch._C.Node, key: str):
+        """Gets attributes of a node which is polymorphic over return type.
+        From https://github.com/pytorch/pytorch/pull/82628
+        """
+        sel = node.kindOf(key)
+        return getattr(node, sel)(key)
+    def patch_device(module):
+        try:
+            graphs = [module.graph] if hasattr(module, "graph") else []
+        except RuntimeError:
+            graphs = []
+        if hasattr(module, "forward1"):
+            graphs.append(module.forward1.graph)
+        for graph in graphs:
+            for node in graph.findAllNodes("prim::Constant"):
+                if "value" in node.attributeNames() and str(_node_get(node, "value")).startswith("cuda"):
+                    node.copyAttributes(device_node)
+    model.apply(patch_device)
+    patch_device(model.encode_image)
+    patch_device(model.encode_text)
+    # patch dtype to float32 on CPU
+    if str(device) == "cpu":
+        float_holder = torch.jit.trace(lambda: torch.ones([]).float(), example_inputs=[])
+        float_input = list(float_holder.graph.findNode("aten::to").inputs())[1]
+        float_node = float_input.node()
+        def patch_float(module):
+            try:
+                graphs = [module.graph] if hasattr(module, "graph") else []
+            except RuntimeError:
+                graphs = []
+            if hasattr(module, "forward1"):
+                graphs.append(module.forward1.graph)
+            for graph in graphs:
+                for node in graph.findAllNodes("aten::to"):
+                    inputs = list(node.inputs())
+                    for i in [1, 2]:  # dtype can be the second or third argument to aten::to()
+                        if _node_get(inputs[i].node(), "value") == 5:
+                            inputs[i].node().copyAttributes(float_node)
+        model.apply(patch_float)
+        patch_float(model.encode_image)
+        patch_float(model.encode_text)
+        model.float()
+    return model, _transform(model.input_resolution.item())
+def tokenize(texts: Union[str, List[str]], context_length: int = 77, truncate: bool = False) -> Union[torch.IntTensor, torch.LongTensor]:
+    """
+    Returns the tokenized representation of given input string(s)
+    Parameters
+    ----------
+    texts : Union[str, List[str]]
+        An input string or a list of input strings to tokenize
+    context_length : int
+        The context length to use; all CLIP models use 77 as the context length
+    truncate: bool
+        Whether to truncate the text in case its encoding is longer than the context length
+    Returns
+    -------
+    A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length].
+    We return LongTensor when torch version is <1.8.0, since older index_select requires indices to be long.
+    """
+    if isinstance(texts, str):
+        texts = [texts]
+    sot_token = _tokenizer.encoder["<|startoftext|>"]
+    eot_token = _tokenizer.encoder["<|endoftext|>"]
+    all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]
+    if packaging.version.parse(torch.__version__) < packaging.version.parse("1.8.0"):
+        result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
+    else:
+        result = torch.zeros(len(all_tokens), context_length, dtype=torch.int)
+    for i, tokens in enumerate(all_tokens):
+        if len(tokens) > context_length:
+            if truncate:
+                tokens = tokens[:context_length]
+                tokens[-1] = eot_token
+            else:
+                raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
+        result[i, :len(tokens)] = torch.tensor(tokens)
+    return result

CLIP/clip/model.py ADDED Viewed

	@@ -0,0 +1,436 @@

+from collections import OrderedDict
+from typing import Tuple, Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+class Bottleneck(nn.Module):
+    expansion = 4
+    def __init__(self, inplanes, planes, stride=1):
+        super().__init__()
+        # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
+        self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+        self.relu1 = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes)
+        self.relu2 = nn.ReLU(inplace=True)
+        self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
+        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
+        self.relu3 = nn.ReLU(inplace=True)
+        self.downsample = None
+        self.stride = stride
+        if stride > 1 or inplanes != planes * Bottleneck.expansion:
+            # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
+            self.downsample = nn.Sequential(OrderedDict([
+                ("-1", nn.AvgPool2d(stride)),
+                ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
+                ("1", nn.BatchNorm2d(planes * self.expansion))
+            ]))
+    def forward(self, x: torch.Tensor):
+        identity = x
+        out = self.relu1(self.bn1(self.conv1(x)))
+        out = self.relu2(self.bn2(self.conv2(out)))
+        out = self.avgpool(out)
+        out = self.bn3(self.conv3(out))
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out += identity
+        out = self.relu3(out)
+        return out
+class AttentionPool2d(nn.Module):
+    def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
+        super().__init__()
+        self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
+        self.num_heads = num_heads
+    def forward(self, x):
+        x = x.flatten(start_dim=2).permute(2, 0, 1)  # NCHW -> (HW)NC
+        x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
+        x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
+        x, _ = F.multi_head_attention_forward(
+            query=x[:1], key=x, value=x,
+            embed_dim_to_check=x.shape[-1],
+            num_heads=self.num_heads,
+            q_proj_weight=self.q_proj.weight,
+            k_proj_weight=self.k_proj.weight,
+            v_proj_weight=self.v_proj.weight,
+            in_proj_weight=None,
+            in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
+            bias_k=None,
+            bias_v=None,
+            add_zero_attn=False,
+            dropout_p=0,
+            out_proj_weight=self.c_proj.weight,
+            out_proj_bias=self.c_proj.bias,
+            use_separate_proj_weight=True,
+            training=self.training,
+            need_weights=False
+        )
+        return x.squeeze(0)
+class ModifiedResNet(nn.Module):
+    """
+    A ResNet class that is similar to torchvision's but contains the following changes:
+    - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
+    - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
+    - The final pooling layer is a QKV attention instead of an average pool
+    """
+    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
+        super().__init__()
+        self.output_dim = output_dim
+        self.input_resolution = input_resolution
+        # the 3-layer stem
+        self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(width // 2)
+        self.relu1 = nn.ReLU(inplace=True)
+        self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(width // 2)
+        self.relu2 = nn.ReLU(inplace=True)
+        self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
+        self.bn3 = nn.BatchNorm2d(width)
+        self.relu3 = nn.ReLU(inplace=True)
+        self.avgpool = nn.AvgPool2d(2)
+        # residual layers
+        self._inplanes = width  # this is a *mutable* variable used during construction
+        self.layer1 = self._make_layer(width, layers[0])
+        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
+        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
+        self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
+        embed_dim = width * 32  # the ResNet feature dimension
+        self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)
+    def _make_layer(self, planes, blocks, stride=1):
+        layers = [Bottleneck(self._inplanes, planes, stride)]
+        self._inplanes = planes * Bottleneck.expansion
+        for _ in range(1, blocks):
+            layers.append(Bottleneck(self._inplanes, planes))
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        def stem(x):
+            x = self.relu1(self.bn1(self.conv1(x)))
+            x = self.relu2(self.bn2(self.conv2(x)))
+            x = self.relu3(self.bn3(self.conv3(x)))
+            x = self.avgpool(x)
+            return x
+        x = x.type(self.conv1.weight.dtype)
+        x = stem(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.attnpool(x)
+        return x
+class LayerNorm(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16."""
+    def forward(self, x: torch.Tensor):
+        orig_type = x.dtype
+        ret = super().forward(x.type(torch.float32))
+        return ret.type(orig_type)
+class QuickGELU(nn.Module):
+    def forward(self, x: torch.Tensor):
+        return x * torch.sigmoid(1.702 * x)
+class ResidualAttentionBlock(nn.Module):
+    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
+        super().__init__()
+        self.attn = nn.MultiheadAttention(d_model, n_head)
+        self.ln_1 = LayerNorm(d_model)
+        self.mlp = nn.Sequential(OrderedDict([
+            ("c_fc", nn.Linear(d_model, d_model * 4)),
+            ("gelu", QuickGELU()),
+            ("c_proj", nn.Linear(d_model * 4, d_model))
+        ]))
+        self.ln_2 = LayerNorm(d_model)
+        self.attn_mask = attn_mask
+    def attention(self, x: torch.Tensor):
+        self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
+        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
+    def forward(self, x: torch.Tensor):
+        x = x + self.attention(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+class Transformer(nn.Module):
+    def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
+        super().__init__()
+        self.width = width
+        self.layers = layers
+        self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])
+    def forward(self, x: torch.Tensor):
+        return self.resblocks(x)
+class VisionTransformer(nn.Module):
+    def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.output_dim = output_dim
+        self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)
+        scale = width ** -0.5
+        self.class_embedding = nn.Parameter(scale * torch.randn(width))
+        self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
+        self.ln_pre = LayerNorm(width)
+        self.transformer = Transformer(width, layers, heads)
+        self.ln_post = LayerNorm(width)
+        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
+    def forward(self, x: torch.Tensor):
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
+        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
+        x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
+        x = x + self.positional_embedding.to(x.dtype)
+        x = self.ln_pre(x)
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.ln_post(x[:, 0, :])
+        if self.proj is not None:
+            x = x @ self.proj
+        return x
+class CLIP(nn.Module):
+    def __init__(self,
+                 embed_dim: int,
+                 # vision
+                 image_resolution: int,
+                 vision_layers: Union[Tuple[int, int, int, int], int],
+                 vision_width: int,
+                 vision_patch_size: int,
+                 # text
+                 context_length: int,
+                 vocab_size: int,
+                 transformer_width: int,
+                 transformer_heads: int,
+                 transformer_layers: int
+                 ):
+        super().__init__()
+        self.context_length = context_length
+        if isinstance(vision_layers, (tuple, list)):
+            vision_heads = vision_width * 32 // 64
+            self.visual = ModifiedResNet(
+                layers=vision_layers,
+                output_dim=embed_dim,
+                heads=vision_heads,
+                input_resolution=image_resolution,
+                width=vision_width
+            )
+        else:
+            vision_heads = vision_width // 64
+            self.visual = VisionTransformer(
+                input_resolution=image_resolution,
+                patch_size=vision_patch_size,
+                width=vision_width,
+                layers=vision_layers,
+                heads=vision_heads,
+                output_dim=embed_dim
+            )
+        self.transformer = Transformer(
+            width=transformer_width,
+            layers=transformer_layers,
+            heads=transformer_heads,
+            attn_mask=self.build_attention_mask()
+        )
+        self.vocab_size = vocab_size
+        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
+        self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
+        self.ln_final = LayerNorm(transformer_width)
+        self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+        self.initialize_parameters()
+    def initialize_parameters(self):
+        nn.init.normal_(self.token_embedding.weight, std=0.02)
+        nn.init.normal_(self.positional_embedding, std=0.01)
+        if isinstance(self.visual, ModifiedResNet):
+            if self.visual.attnpool is not None:
+                std = self.visual.attnpool.c_proj.in_features ** -0.5
+                nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
+                nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)
+            for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
+                for name, param in resnet_block.named_parameters():
+                    if name.endswith("bn3.weight"):
+                        nn.init.zeros_(param)
+        proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
+        attn_std = self.transformer.width ** -0.5
+        fc_std = (2 * self.transformer.width) ** -0.5
+        for block in self.transformer.resblocks:
+            nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
+            nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
+            nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
+            nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
+        if self.text_projection is not None:
+            nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)
+    def build_attention_mask(self):
+        # lazily create causal attention mask, with full attention between the vision tokens
+        # pytorch uses additive attention mask; fill with -inf
+        mask = torch.empty(self.context_length, self.context_length)
+        mask.fill_(float("-inf"))
+        mask.triu_(1)  # zero out the lower diagonal
+        return mask
+    @property
+    def dtype(self):
+        return self.visual.conv1.weight.dtype
+    def encode_image(self, image):
+        return self.visual(image.type(self.dtype))
+    def encode_text(self, text):
+        x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]
+        x = x + self.positional_embedding.type(self.dtype)
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.ln_final(x).type(self.dtype)
+        # x.shape = [batch_size, n_ctx, transformer.width]
+        # take features from the eot embedding (eot_token is the highest number in each sequence)
+        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
+        return x
+    def forward(self, image, text):
+        image_features = self.encode_image(image)
+        text_features = self.encode_text(text)
+        # normalized features
+        image_features = image_features / image_features.norm(dim=1, keepdim=True)
+        text_features = text_features / text_features.norm(dim=1, keepdim=True)
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_image = logit_scale * image_features @ text_features.t()
+        logits_per_text = logits_per_image.t()
+        # shape = [global_batch_size, global_batch_size]
+        return logits_per_image, logits_per_text
+def convert_weights(model: nn.Module):
+    """Convert applicable model parameters to fp16"""
+    def _convert_weights_to_fp16(l):
+        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
+            l.weight.data = l.weight.data.half()
+            if l.bias is not None:
+                l.bias.data = l.bias.data.half()
+        if isinstance(l, nn.MultiheadAttention):
+            for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
+                tensor = getattr(l, attr)
+                if tensor is not None:
+                    tensor.data = tensor.data.half()
+        for name in ["text_projection", "proj"]:
+            if hasattr(l, name):
+                attr = getattr(l, name)
+                if attr is not None:
+                    attr.data = attr.data.half()
+    model.apply(_convert_weights_to_fp16)
+def build_model(state_dict: dict):
+    vit = "visual.proj" in state_dict
+    if vit:
+        vision_width = state_dict["visual.conv1.weight"].shape[0]
+        vision_layers = len([k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
+        vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
+        grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
+        image_resolution = vision_patch_size * grid_size
+    else:
+        counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
+        vision_layers = tuple(counts)
+        vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
+        output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
+        vision_patch_size = None
+        assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
+        image_resolution = output_width * 32
+    embed_dim = state_dict["text_projection"].shape[1]
+    context_length = state_dict["positional_embedding"].shape[0]
+    vocab_size = state_dict["token_embedding.weight"].shape[0]
+    transformer_width = state_dict["ln_final.weight"].shape[0]
+    transformer_heads = transformer_width // 64
+    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")))
+    model = CLIP(
+        embed_dim,
+        image_resolution, vision_layers, vision_width, vision_patch_size,
+        context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
+    )
+    for key in ["input_resolution", "context_length", "vocab_size"]:
+        if key in state_dict:
+            del state_dict[key]
+    convert_weights(model)
+    model.load_state_dict(state_dict)
+    return model.eval()

CLIP/clip/simple_tokenizer.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import gzip
+import html
+import os
+from functools import lru_cache
+import ftfy
+import regex as re
+@lru_cache()
+def default_bpe():
+    return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+def whitespace_clean(text):
+    text = re.sub(r'\s+', ' ', text)
+    text = text.strip()
+    return text
+class SimpleTokenizer(object):
+    def __init__(self, bpe_path: str = default_bpe()):
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
+        merges = merges[1:49152-256-2+1]
+        merges = [tuple(merge.split()) for merge in merges]
+        vocab = list(bytes_to_unicode().values())
+        vocab = vocab + [v+'</w>' for v in vocab]
+        for merge in merges:
+            vocab.append(''.join(merge))
+        vocab.extend(['<|startoftext|>', '<|endoftext|>'])
+        self.encoder = dict(zip(vocab, range(len(vocab))))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
+        self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + ( token[-1] + '</w>',)
+        pairs = get_pairs(word)
+        if not pairs:
+            return token+'</w>'
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+    def encode(self, text):
+        bpe_tokens = []
+        text = whitespace_clean(basic_clean(text)).lower()
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
+        return text

CLIP/data/country211.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# The Country211 Dataset
+In the paper, we used an image classification dataset called Country211, to evaluate the model's capability on geolocation. To do so, we filtered the YFCC100m dataset that have GPS coordinate corresponding to a [ISO-3166 country code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) and created a balanced dataset by sampling 150 train images, 50 validation images, and 100 test images images for each country.
+The following command will download an 11GB archive countaining the images and extract into a subdirectory `country211`:
+```bash
+wget https://openaipublic.azureedge.net/clip/data/country211.tgz
+tar zxvf country211.tgz
+```
+These images are a subset of the YFCC100m dataset. Use of the underlying media files is subject to the Creative Commons licenses chosen by their creators/uploaders. For more information about the YFCC100M dataset, visit [the official website](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).

CLIP/data/prompts.md ADDED Viewed

	@@ -0,0 +1,3401 @@

+# Prompts for Image Classification
+Below are the class names and templates that are used for collecting the zero-shot classification scores in the paper. Each dataset has two lists `classes` and `templates`, where the string `{}` in the template is to be replaced with the corresponding class names. For the Facial Emotion Recognition 2013 dataset specifically, we used multiple class names for certain classes.
+This file contains prompt data for 26 of the 27 datasets shown in Table 9 of the paper; the text prompts for ImageNet (as well as other [ImageNet Testbed](https://modestyachts.github.io/imagenet-testbed/) datasets in Figure 13) can be found in [this notebook](https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb), as well as how to ensemble predictions from multiple prompts using these templates.
+If you are viewing this document on GitHub, use the table of contents icon at the upper left to browse the datasets.
+## Birdsnap
+```bash
+classes = [
+    'Acadian Flycatcher',
+    'Acorn Woodpecker',
+    'Alder Flycatcher',
+    'Allens Hummingbird',
+    'Altamira Oriole',
+    'American Avocet',
+    'American Bittern',
+    'American Black Duck',
+    'American Coot',
+    'American Crow',
+    'American Dipper',
+    'American Golden Plover',
+    'American Goldfinch',
+    'American Kestrel',
+    'American Oystercatcher',
+    'American Pipit',
+    'American Redstart',
+    'American Robin',
+    'American Three toed Woodpecker',
+    'American Tree Sparrow',
+    'American White Pelican',
+    'American Wigeon',
+    'American Woodcock',
+    'Anhinga',
+    'Annas Hummingbird',
+    'Arctic Tern',
+    'Ash throated Flycatcher',
+    'Audubons Oriole',
+    'Bairds Sandpiper',
+    'Bald Eagle',
+    'Baltimore Oriole',
+    'Band tailed Pigeon',
+    'Barn Swallow',
+    'Barred Owl',
+    'Barrows Goldeneye',
+    'Bay breasted Warbler',
+    'Bells Vireo',
+    'Belted Kingfisher',
+    'Bewicks Wren',
+    'Black Guillemot',
+    'Black Oystercatcher',
+    'Black Phoebe',
+    'Black Rosy Finch',
+    'Black Scoter',
+    'Black Skimmer',
+    'Black Tern',
+    'Black Turnstone',
+    'Black Vulture',
+    'Black and white Warbler',
+    'Black backed Woodpecker',
+    'Black bellied Plover',
+    'Black billed Cuckoo',
+    'Black billed Magpie',
+    'Black capped Chickadee',
+    'Black chinned Hummingbird',
+    'Black chinned Sparrow',
+    'Black crested Titmouse',
+    'Black crowned Night Heron',
+    'Black headed Grosbeak',
+    'Black legged Kittiwake',
+    'Black necked Stilt',
+    'Black throated Blue Warbler',
+    'Black throated Gray Warbler',
+    'Black throated Green Warbler',
+    'Black throated Sparrow',
+    'Blackburnian Warbler',
+    'Blackpoll Warbler',
+    'Blue Grosbeak',
+    'Blue Jay',
+    'Blue gray Gnatcatcher',
+    'Blue headed Vireo',
+    'Blue winged Teal',
+    'Blue winged Warbler',
+    'Boat tailed Grackle',
+    'Bobolink',
+    'Bohemian Waxwing',
+    'Bonapartes Gull',
+    'Boreal Chickadee',
+    'Brandts Cormorant',
+    'Brant',
+    'Brewers Blackbird',
+    'Brewers Sparrow',
+    'Bridled Titmouse',
+    'Broad billed Hummingbird',
+    'Broad tailed Hummingbird',
+    'Broad winged Hawk',
+    'Bronzed Cowbird',
+    'Brown Creeper',
+    'Brown Pelican',
+    'Brown Thrasher',
+    'Brown capped Rosy Finch',
+    'Brown crested Flycatcher',
+    'Brown headed Cowbird',
+    'Brown headed Nuthatch',
+    'Bufflehead',
+    'Bullocks Oriole',
+    'Burrowing Owl',
+    'Bushtit',
+    'Cackling Goose',
+    'Cactus Wren',
+    'California Gull',
+    'California Quail',
+    'California Thrasher',
+    'California Towhee',
+    'Calliope Hummingbird',
+    'Canada Goose',
+    'Canada Warbler',
+    'Canvasback',
+    'Canyon Towhee',
+    'Canyon Wren',
+    'Cape May Warbler',
+    'Carolina Chickadee',
+    'Carolina Wren',
+    'Caspian Tern',
+    'Cassins Finch',
+    'Cassins Kingbird',
+    'Cassins Sparrow',
+    'Cassins Vireo',
+    'Cattle Egret',
+    'Cave Swallow',
+    'Cedar Waxwing',
+    'Cerulean Warbler',
+    'Chestnut backed Chickadee',
+    'Chestnut collared Longspur',
+    'Chestnut sided Warbler',
+    'Chihuahuan Raven',
+    'Chimney Swift',
+    'Chipping Sparrow',
+    'Cinnamon Teal',
+    'Clapper Rail',
+    'Clarks Grebe',
+    'Clarks Nutcracker',
+    'Clay colored Sparrow',
+    'Cliff Swallow',
+    'Common Black Hawk',
+    'Common Eider',
+    'Common Gallinule',
+    'Common Goldeneye',
+    'Common Grackle',
+    'Common Ground Dove',
+    'Common Loon',
+    'Common Merganser',
+    'Common Murre',
+    'Common Nighthawk',
+    'Common Raven',
+    'Common Redpoll',
+    'Common Tern',
+    'Common Yellowthroat',
+    'Connecticut Warbler',
+    'Coopers Hawk',
+    'Cordilleran Flycatcher',
+    'Costas Hummingbird',
+    'Couchs Kingbird',
+    'Crested Caracara',
+    'Curve billed Thrasher',
+    'Dark eyed Junco',
+    'Dickcissel',
+    'Double crested Cormorant',
+    'Downy Woodpecker',
+    'Dunlin',
+    'Dusky Flycatcher',
+    'Dusky Grouse',
+    'Eared Grebe',
+    'Eastern Bluebird',
+    'Eastern Kingbird',
+    'Eastern Meadowlark',
+    'Eastern Phoebe',
+    'Eastern Screech Owl',
+    'Eastern Towhee',
+    'Eastern Wood Pewee',
+    'Elegant Trogon',
+    'Elf Owl',
+    'Eurasian Collared Dove',
+    'Eurasian Wigeon',
+    'European Starling',
+    'Evening Grosbeak',
+    'Ferruginous Hawk',
+    'Ferruginous Pygmy Owl',
+    'Field Sparrow',
+    'Fish Crow',
+    'Florida Scrub Jay',
+    'Forsters Tern',
+    'Fox Sparrow',
+    'Franklins Gull',
+    'Fulvous Whistling Duck',
+    'Gadwall',
+    'Gambels Quail',
+    'Gila Woodpecker',
+    'Glaucous Gull',
+    'Glaucous winged Gull',
+    'Glossy Ibis',
+    'Golden Eagle',
+    'Golden crowned Kinglet',
+    'Golden crowned Sparrow',
+    'Golden fronted Woodpecker',
+    'Golden winged Warbler',
+    'Grasshopper Sparrow',
+    'Gray Catbird',
+    'Gray Flycatcher',
+    'Gray Jay',
+    'Gray Kingbird',
+    'Gray cheeked Thrush',
+    'Gray crowned Rosy Finch',
+    'Great Black backed Gull',
+    'Great Blue Heron',
+    'Great Cormorant',
+    'Great Crested Flycatcher',
+    'Great Egret',
+    'Great Gray Owl',
+    'Great Horned Owl',
+    'Great Kiskadee',
+    'Great tailed Grackle',
+    'Greater Prairie Chicken',
+    'Greater Roadrunner',
+    'Greater Sage Grouse',
+    'Greater Scaup',
+    'Greater White fronted Goose',
+    'Greater Yellowlegs',
+    'Green Jay',
+    'Green tailed Towhee',
+    'Green winged Teal',
+    'Groove billed Ani',
+    'Gull billed Tern',
+    'Hairy Woodpecker',
+    'Hammonds Flycatcher',
+    'Harlequin Duck',
+    'Harriss Hawk',
+    'Harriss Sparrow',
+    'Heermanns Gull',
+    'Henslows Sparrow',
+    'Hepatic Tanager',
+    'Hermit Thrush',
+    'Herring Gull',
+    'Hoary Redpoll',
+    'Hooded Merganser',
+    'Hooded Oriole',
+    'Hooded Warbler',
+    'Horned Grebe',
+    'Horned Lark',
+    'House Finch',
+    'House Sparrow',
+    'House Wren',
+    'Huttons Vireo',
+    'Iceland Gull',
+    'Inca Dove',
+    'Indigo Bunting',
+    'Killdeer',
+    'King Rail',
+    'Ladder backed Woodpecker',
+    'Lapland Longspur',
+    'Lark Bunting',
+    'Lark Sparrow',
+    'Laughing Gull',
+    'Lazuli Bunting',
+    'Le Contes Sparrow',
+    'Least Bittern',
+    'Least Flycatcher',
+    'Least Grebe',
+    'Least Sandpiper',
+    'Least Tern',
+    'Lesser Goldfinch',
+    'Lesser Nighthawk',
+    'Lesser Scaup',
+    'Lesser Yellowlegs',
+    'Lewiss Woodpecker',
+    'Limpkin',
+    'Lincolns Sparrow',
+    'Little Blue Heron',
+    'Loggerhead Shrike',
+    'Long billed Curlew',
+    'Long billed Dowitcher',
+    'Long billed Thrasher',
+    'Long eared Owl',
+    'Long tailed Duck',
+    'Louisiana Waterthrush',
+    'Magnificent Frigatebird',
+    'Magnolia Warbler',
+    'Mallard',
+    'Marbled Godwit',
+    'Marsh Wren',
+    'Merlin',
+    'Mew Gull',
+    'Mexican Jay',
+    'Mississippi Kite',
+    'Monk Parakeet',
+    'Mottled Duck',
+    'Mountain Bluebird',
+    'Mountain Chickadee',
+    'Mountain Plover',
+    'Mourning Dove',
+    'Mourning Warbler',
+    'Muscovy Duck',
+    'Mute Swan',
+    'Nashville Warbler',
+    'Nelsons Sparrow',
+    'Neotropic Cormorant',
+    'Northern Bobwhite',
+    'Northern Cardinal',
+    'Northern Flicker',
+    'Northern Gannet',
+    'Northern Goshawk',
+    'Northern Harrier',
+    'Northern Hawk Owl',
+    'Northern Mockingbird',
+    'Northern Parula',
+    'Northern Pintail',
+    'Northern Rough winged Swallow',
+    'Northern Saw whet Owl',
+    'Northern Shrike',
+    'Northern Waterthrush',
+    'Nuttalls Woodpecker',
+    'Oak Titmouse',
+    'Olive Sparrow',
+    'Olive sided Flycatcher',
+    'Orange crowned Warbler',
+    'Orchard Oriole',
+    'Osprey',
+    'Ovenbird',
+    'Pacific Golden Plover',
+    'Pacific Loon',
+    'Pacific Wren',
+    'Pacific slope Flycatcher',
+    'Painted Bunting',
+    'Painted Redstart',
+    'Palm Warbler',
+    'Pectoral Sandpiper',
+    'Peregrine Falcon',
+    'Phainopepla',
+    'Philadelphia Vireo',
+    'Pied billed Grebe',
+    'Pigeon Guillemot',
+    'Pileated Woodpecker',
+    'Pine Grosbeak',
+    'Pine Siskin',
+    'Pine Warbler',
+    'Piping Plover',
+    'Plumbeous Vireo',
+    'Prairie Falcon',
+    'Prairie Warbler',
+    'Prothonotary Warbler',
+    'Purple Finch',
+    'Purple Gallinule',
+    'Purple Martin',
+    'Purple Sandpiper',
+    'Pygmy Nuthatch',
+    'Pyrrhuloxia',
+    'Red Crossbill',
+    'Red Knot',
+    'Red Phalarope',
+    'Red bellied Woodpecker',
+    'Red breasted Merganser',
+    'Red breasted Nuthatch',
+    'Red breasted Sapsucker',
+    'Red cockaded Woodpecker',
+    'Red eyed Vireo',
+    'Red headed Woodpecker',
+    'Red naped Sapsucker',
+    'Red necked Grebe',
+    'Red necked Phalarope',
+    'Red shouldered Hawk',
+    'Red tailed Hawk',
+    'Red throated Loon',
+    'Red winged Blackbird',
+    'Reddish Egret',
+    'Redhead',
+    'Ring billed Gull',
+    'Ring necked Duck',
+    'Ring necked Pheasant',
+    'Rock Pigeon',
+    'Rock Ptarmigan',
+    'Rock Sandpiper',
+    'Rock Wren',
+    'Rose breasted Grosbeak',
+    'Roseate Tern',
+    'Rosss Goose',
+    'Rough legged Hawk',
+    'Royal Tern',
+    'Ruby crowned Kinglet',
+    'Ruby throated Hummingbird',
+    'Ruddy Duck',
+    'Ruddy Turnstone',
+    'Ruffed Grouse',
+    'Rufous Hummingbird',
+    'Rufous crowned Sparrow',
+    'Rusty Blackbird',
+    'Sage Thrasher',
+    'Saltmarsh Sparrow',
+    'Sanderling',
+    'Sandhill Crane',
+    'Sandwich Tern',
+    'Says Phoebe',
+    'Scaled Quail',
+    'Scarlet Tanager',
+    'Scissor tailed Flycatcher',
+    'Scotts Oriole',
+    'Seaside Sparrow',
+    'Sedge Wren',
+    'Semipalmated Plover',
+    'Semipalmated Sandpiper',
+    'Sharp shinned Hawk',
+    'Sharp tailed Grouse',
+    'Short billed Dowitcher',
+    'Short eared Owl',
+    'Snail Kite',
+    'Snow Bunting',
+    'Snow Goose',
+    'Snowy Egret',
+    'Snowy Owl',
+    'Snowy Plover',
+    'Solitary Sandpiper',
+    'Song Sparrow',
+    'Sooty Grouse',
+    'Sora',
+    'Spotted Owl',
+    'Spotted Sandpiper',
+    'Spotted Towhee',
+    'Spruce Grouse',
+    'Stellers Jay',
+    'Stilt Sandpiper',
+    'Summer Tanager',
+    'Surf Scoter',
+    'Surfbird',
+    'Swainsons Hawk',
+    'Swainsons Thrush',
+    'Swallow tailed Kite',
+    'Swamp Sparrow',
+    'Tennessee Warbler',
+    'Thayers Gull',
+    'Townsends Solitaire',
+    'Townsends Warbler',
+    'Tree Swallow',
+    'Tricolored Heron',
+    'Tropical Kingbird',
+    'Trumpeter Swan',
+    'Tufted Titmouse',
+    'Tundra Swan',
+    'Turkey Vulture',
+    'Upland Sandpiper',
+    'Varied Thrush',
+    'Veery',
+    'Verdin',
+    'Vermilion Flycatcher',
+    'Vesper Sparrow',
+    'Violet green Swallow',
+    'Virginia Rail',
+    'Wandering Tattler',
+    'Warbling Vireo',
+    'Western Bluebird',
+    'Western Grebe',
+    'Western Gull',
+    'Western Kingbird',
+    'Western Meadowlark',
+    'Western Sandpiper',
+    'Western Screech Owl',
+    'Western Scrub Jay',
+    'Western Tanager',
+    'Western Wood Pewee',
+    'Whimbrel',
+    'White Ibis',
+    'White breasted Nuthatch',
+    'White crowned Sparrow',
+    'White eyed Vireo',
+    'White faced Ibis',
+    'White headed Woodpecker',
+    'White rumped Sandpiper',
+    'White tailed Hawk',
+    'White tailed Kite',
+    'White tailed Ptarmigan',
+    'White throated Sparrow',
+    'White throated Swift',
+    'White winged Crossbill',
+    'White winged Dove',
+    'White winged Scoter',
+    'Wild Turkey',
+    'Willet',
+    'Williamsons Sapsucker',
+    'Willow Flycatcher',
+    'Willow Ptarmigan',
+    'Wilsons Phalarope',
+    'Wilsons Plover',
+    'Wilsons Snipe',
+    'Wilsons Warbler',
+    'Winter Wren',
+    'Wood Stork',
+    'Wood Thrush',
+    'Worm eating Warbler',
+    'Wrentit',
+    'Yellow Warbler',
+    'Yellow bellied Flycatcher',
+    'Yellow bellied Sapsucker',
+    'Yellow billed Cuckoo',
+    'Yellow billed Magpie',
+    'Yellow breasted Chat',
+    'Yellow crowned Night Heron',
+    'Yellow eyed Junco',
+    'Yellow headed Blackbird',
+    'Yellow rumped Warbler',
+    'Yellow throated Vireo',
+    'Yellow throated Warbler',
+    'Zone tailed Hawk',
+]
+templates = [
+    'a photo of a {}, a type of bird.',
+]
+```
+## CIFAR10
+```bash
+classes = [
+    'airplane',
+    'automobile',
+    'bird',
+    'cat',
+    'deer',
+    'dog',
+    'frog',
+    'horse',
+    'ship',
+    'truck',
+]
+templates = [
+    'a photo of a {}.',
+    'a blurry photo of a {}.',
+    'a black and white photo of a {}.',
+    'a low contrast photo of a {}.',
+    'a high contrast photo of a {}.',
+    'a bad photo of a {}.',
+    'a good photo of a {}.',
+    'a photo of a small {}.',
+    'a photo of a big {}.',
+    'a photo of the {}.',
+    'a blurry photo of the {}.',
+    'a black and white photo of the {}.',
+    'a low contrast photo of the {}.',
+    'a high contrast photo of the {}.',
+    'a bad photo of the {}.',
+    'a good photo of the {}.',
+    'a photo of the small {}.',
+    'a photo of the big {}.',
+]
+```
+## CIFAR100
+```bash
+classes = [
+    'apple',
+    'aquarium fish',
+    'baby',
+    'bear',
+    'beaver',
+    'bed',
+    'bee',
+    'beetle',
+    'bicycle',
+    'bottle',
+    'bowl',
+    'boy',
+    'bridge',
+    'bus',
+    'butterfly',
+    'camel',
+    'can',
+    'castle',
+    'caterpillar',
+    'cattle',
+    'chair',
+    'chimpanzee',
+    'clock',
+    'cloud',
+    'cockroach',
+    'couch',
+    'crab',
+    'crocodile',
+    'cup',
+    'dinosaur',
+    'dolphin',
+    'elephant',
+    'flatfish',
+    'forest',
+    'fox',
+    'girl',
+    'hamster',
+    'house',
+    'kangaroo',
+    'keyboard',
+    'lamp',
+    'lawn mower',
+    'leopard',
+    'lion',
+    'lizard',
+    'lobster',
+    'man',
+    'maple tree',
+    'motorcycle',
+    'mountain',
+    'mouse',
+    'mushroom',
+    'oak tree',
+    'orange',
+    'orchid',
+    'otter',
+    'palm tree',
+    'pear',
+    'pickup truck',
+    'pine tree',
+    'plain',
+    'plate',
+    'poppy',
+    'porcupine',
+    'possum',
+    'rabbit',
+    'raccoon',
+    'ray',
+    'road',
+    'rocket',
+    'rose',
+    'sea',
+    'seal',
+    'shark',
+    'shrew',
+    'skunk',
+    'skyscraper',
+    'snail',
+    'snake',
+    'spider',
+    'squirrel',
+    'streetcar',
+    'sunflower',
+    'sweet pepper',
+    'table',
+    'tank',
+    'telephone',
+    'television',
+    'tiger',
+    'tractor',
+    'train',
+    'trout',
+    'tulip',
+    'turtle',
+    'wardrobe',
+    'whale',
+    'willow tree',
+    'wolf',
+    'woman',
+    'worm',
+]
+templates = [
+    'a photo of a {}.',
+    'a blurry photo of a {}.',
+    'a black and white photo of a {}.',
+    'a low contrast photo of a {}.',
+    'a high contrast photo of a {}.',
+    'a bad photo of a {}.',
+    'a good photo of a {}.',
+    'a photo of a small {}.',
+    'a photo of a big {}.',
+    'a photo of the {}.',
+    'a blurry photo of the {}.',
+    'a black and white photo of the {}.',
+    'a low contrast photo of the {}.',
+    'a high contrast photo of the {}.',
+    'a bad photo of the {}.',
+    'a good photo of the {}.',
+    'a photo of the small {}.',
+    'a photo of the big {}.',
+]
+```
+## CLEVRCounts
+```bash
+classes = [
+    '10',
+    '3',
+    '4',
+    '5',
+    '6',
+    '7',
+    '8',
+    '9',
+]
+templates = [
+    'a photo of {} objects.',
+]
+```
+## Caltech101
+```bash
+classes = [
+    'background',
+    'off-center face',
+    'centered face',
+    'leopard',
+    'motorbike',
+    'accordion',
+    'airplane',
+    'anchor',
+    'ant',
+    'barrel',
+    'bass',
+    'beaver',
+    'binocular',
+    'bonsai',
+    'brain',
+    'brontosaurus',
+    'buddha',
+    'butterfly',
+    'camera',
+    'cannon',
+    'side of a car',
+    'ceiling fan',
+    'cellphone',
+    'chair',
+    'chandelier',
+    'body of a cougar cat',
+    'face of a cougar cat',
+    'crab',
+    'crayfish',
+    'crocodile',
+    'head of a  crocodile',
+    'cup',
+    'dalmatian',
+    'dollar bill',
+    'dolphin',
+    'dragonfly',
+    'electric guitar',
+    'elephant',
+    'emu',
+    'euphonium',
+    'ewer',
+    'ferry',
+    'flamingo',
+    'head of a flamingo',
+    'garfield',
+    'gerenuk',
+    'gramophone',
+    'grand piano',
+    'hawksbill',
+    'headphone',
+    'hedgehog',
+    'helicopter',
+    'ibis',
+    'inline skate',
+    'joshua tree',
+    'kangaroo',
+    'ketch',
+    'lamp',
+    'laptop',
+    'llama',
+    'lobster',
+    'lotus',
+    'mandolin',
+    'mayfly',
+    'menorah',
+    'metronome',
+    'minaret',
+    'nautilus',
+    'octopus',
+    'okapi',
+    'pagoda',
+    'panda',
+    'pigeon',
+    'pizza',
+    'platypus',
+    'pyramid',
+    'revolver',
+    'rhino',
+    'rooster',
+    'saxophone',
+    'schooner',
+    'scissors',
+    'scorpion',
+    'sea horse',
+    'snoopy (cartoon beagle)',
+    'soccer ball',
+    'stapler',
+    'starfish',
+    'stegosaurus',
+    'stop sign',
+    'strawberry',
+    'sunflower',
+    'tick',
+    'trilobite',
+    'umbrella',
+    'watch',
+    'water lilly',
+    'wheelchair',
+    'wild cat',
+    'windsor chair',
+    'wrench',
+    'yin and yang symbol',
+]
+templates = [
+    'a photo of a {}.',
+    'a painting of a {}.',
+    'a plastic {}.',
+    'a sculpture of a {}.',
+    'a sketch of a {}.',
+    'a tattoo of a {}.',
+    'a toy {}.',
+    'a rendition of a {}.',
+    'a embroidered {}.',
+    'a cartoon {}.',
+    'a {} in a video game.',
+    'a plushie {}.',
+    'a origami {}.',
+    'art of a {}.',
+    'graffiti of a {}.',
+    'a drawing of a {}.',
+    'a doodle of a {}.',
+    'a photo of the {}.',
+    'a painting of the {}.',
+    'the plastic {}.',
+    'a sculpture of the {}.',
+    'a sketch of the {}.',
+    'a tattoo of the {}.',
+    'the toy {}.',
+    'a rendition of the {}.',
+    'the embroidered {}.',
+    'the cartoon {}.',
+    'the {} in a video game.',
+    'the plushie {}.',
+    'the origami {}.',
+    'art of the {}.',
+    'graffiti of the {}.',
+    'a drawing of the {}.',
+    'a doodle of the {}.',
+]
+```
+## Country211
+```bash
+classes = [
+    'Andorra',
+    'United Arab Emirates',
+    'Afghanistan',
+    'Antigua and Barbuda',
+    'Anguilla',
+    'Albania',
+    'Armenia',
+    'Angola',
+    'Antarctica',
+    'Argentina',
+    'Austria',
+    'Australia',
+    'Aruba',
+    'Aland Islands',
+    'Azerbaijan',
+    'Bosnia and Herzegovina',
+    'Barbados',
+    'Bangladesh',
+    'Belgium',
+    'Burkina Faso',
+    'Bulgaria',
+    'Bahrain',
+    'Benin',
+    'Bermuda',
+    'Brunei Darussalam',
+    'Bolivia',
+    'Bonaire, Saint Eustatius and Saba',
+    'Brazil',
+    'Bahamas',
+    'Bhutan',
+    'Botswana',
+    'Belarus',
+    'Belize',
+    'Canada',
+    'DR Congo',
+    'Central African Republic',
+    'Switzerland',
+    "Cote d'Ivoire",
+    'Cook Islands',
+    'Chile',
+    'Cameroon',
+    'China',
+    'Colombia',
+    'Costa Rica',
+    'Cuba',
+    'Cabo Verde',
+    'Curacao',
+    'Cyprus',
+    'Czech Republic',
+    'Germany',
+    'Denmark',
+    'Dominica',
+    'Dominican Republic',
+    'Algeria',
+    'Ecuador',
+    'Estonia',
+    'Egypt',
+    'Spain',
+    'Ethiopia',
+    'Finland',
+    'Fiji',
+    'Falkland Islands',
+    'Faeroe Islands',
+    'France',
+    'Gabon',
+    'United Kingdom',
+    'Grenada',
+    'Georgia',
+    'French Guiana',
+    'Guernsey',
+    'Ghana',
+    'Gibraltar',
+    'Greenland',
+    'Gambia',
+    'Guadeloupe',
+    'Greece',
+    'South Georgia and South Sandwich Is.',
+    'Guatemala',
+    'Guam',
+    'Guyana',
+    'Hong Kong',
+    'Honduras',
+    'Croatia',
+    'Haiti',
+    'Hungary',
+    'Indonesia',
+    'Ireland',
+    'Israel',
+    'Isle of Man',
+    'India',
+    'Iraq',
+    'Iran',
+    'Iceland',
+    'Italy',
+    'Jersey',
+    'Jamaica',
+    'Jordan',
+    'Japan',
+    'Kenya',
+    'Kyrgyz Republic',
+    'Cambodia',
+    'St. Kitts and Nevis',
+    'North Korea',
+    'South Korea',
+    'Kuwait',
+    'Cayman Islands',
+    'Kazakhstan',
+    'Laos',
+    'Lebanon',
+    'St. Lucia',
+    'Liechtenstein',
+    'Sri Lanka',
+    'Liberia',
+    'Lithuania',
+    'Luxembourg',
+    'Latvia',
+    'Libya',
+    'Morocco',
+    'Monaco',
+    'Moldova',
+    'Montenegro',
+    'Saint-Martin',
+    'Madagascar',
+    'Macedonia',
+    'Mali',
+    'Myanmar',
+    'Mongolia',
+    'Macau',
+    'Martinique',
+    'Mauritania',
+    'Malta',
+    'Mauritius',
+    'Maldives',
+    'Malawi',
+    'Mexico',
+    'Malaysia',
+    'Mozambique',
+    'Namibia',
+    'New Caledonia',
+    'Nigeria',
+    'Nicaragua',
+    'Netherlands',
+    'Norway',
+    'Nepal',
+    'New Zealand',
+    'Oman',
+    'Panama',
+    'Peru',
+    'French Polynesia',
+    'Papua New Guinea',
+    'Philippines',
+    'Pakistan',
+    'Poland',
+    'Puerto Rico',
+    'Palestine',
+    'Portugal',
+    'Palau',
+    'Paraguay',
+    'Qatar',
+    'Reunion',
+    'Romania',
+    'Serbia',
+    'Russia',
+    'Rwanda',
+    'Saudi Arabia',
+    'Solomon Islands',
+    'Seychelles',
+    'Sudan',
+    'Sweden',
+    'Singapore',
+    'St. Helena',
+    'Slovenia',
+    'Svalbard and Jan Mayen Islands',
+    'Slovakia',
+    'Sierra Leone',
+    'San Marino',
+    'Senegal',
+    'Somalia',
+    'South Sudan',
+    'El Salvador',
+    'Sint Maarten',
+    'Syria',
+    'Eswatini',
+    'Togo',
+    'Thailand',
+    'Tajikistan',
+    'Timor-Leste',
+    'Turkmenistan',
+    'Tunisia',
+    'Tonga',
+    'Turkey',
+    'Trinidad and Tobago',
+    'Taiwan',
+    'Tanzania',
+    'Ukraine',
+    'Uganda',
+    'United States',
+    'Uruguay',
+    'Uzbekistan',
+    'Vatican',
+    'Venezuela',
+    'British Virgin Islands',
+    'United States Virgin Islands',
+    'Vietnam',
+    'Vanuatu',
+    'Samoa',
+    'Kosovo',
+    'Yemen',
+    'South Africa',
+    'Zambia',
+    'Zimbabwe',
+]
+templates = [
+    'a photo i took in {}.',
+    'a photo i took while visiting {}.',
+    'a photo from my home country of {}.',
+    'a photo from my visit to {}.',
+    'a photo showing the country of {}.',
+]
+```
+## DescribableTextures
+```bash
+classes = [
+    'banded',
+    'blotchy',
+    'braided',
+    'bubbly',
+    'bumpy',
+    'chequered',
+    'cobwebbed',
+    'cracked',
+    'crosshatched',
+    'crystalline',
+    'dotted',
+    'fibrous',
+    'flecked',
+    'freckled',
+    'frilly',
+    'gauzy',
+    'grid',
+    'grooved',
+    'honeycombed',
+    'interlaced',
+    'knitted',
+    'lacelike',
+    'lined',
+    'marbled',
+    'matted',
+    'meshed',
+    'paisley',
+    'perforated',
+    'pitted',
+    'pleated',
+    'polka-dotted',
+    'porous',
+    'potholed',
+    'scaly',
+    'smeared',
+    'spiralled',
+    'sprinkled',
+    'stained',
+    'stratified',
+    'striped',
+    'studded',
+    'swirly',
+    'veined',
+    'waffled',
+    'woven',
+    'wrinkled',
+    'zigzagged',
+]
+templates = [
+    'a photo of a {} texture.',
+    'a photo of a {} pattern.',
+    'a photo of a {} thing.',
+    'a photo of a {} object.',
+    'a photo of the {} texture.',
+    'a photo of the {} pattern.',
+    'a photo of the {} thing.',
+    'a photo of the {} object.',
+]
+```
+## EuroSAT
+```bash
+classes = [
+    'forest',
+    'permanent crop land',
+    'residential buildings or homes or apartments',
+    'river',
+    'pasture land',
+    'lake or sea',
+    'brushland or shrubland',
+    'annual crop land',
+    'industrial buildings or commercial buildings',
+    'highway or road',
+]
+templates = [
+    'a centered satellite photo of {}.',
+    'a centered satellite photo of a {}.',
+    'a centered satellite photo of the {}.',
+]
+```
+## FGVCAircraft
+```bash
+classes = [
+    '707-320',
+    '727-200',
+    '737-200',
+    '737-300',
+    '737-400',
+    '737-500',
+    '737-600',
+    '737-700',
+    '737-800',
+    '737-900',
+    '747-100',
+    '747-200',
+    '747-300',
+    '747-400',
+    '757-200',
+    '757-300',
+    '767-200',
+    '767-300',
+    '767-400',
+    '777-200',
+    '777-300',
+    'A300B4',
+    'A310',
+    'A318',
+    'A319',
+    'A320',
+    'A321',
+    'A330-200',
+    'A330-300',
+    'A340-200',
+    'A340-300',
+    'A340-500',
+    'A340-600',
+    'A380',
+    'ATR-42',
+    'ATR-72',
+    'An-12',
+    'BAE 146-200',
+    'BAE 146-300',
+    'BAE-125',
+    'Beechcraft 1900',
+    'Boeing 717',
+    'C-130',
+    'C-47',
+    'CRJ-200',
+    'CRJ-700',
+    'CRJ-900',
+    'Cessna 172',
+    'Cessna 208',
+    'Cessna 525',
+    'Cessna 560',
+    'Challenger 600',
+    'DC-10',
+    'DC-3',
+    'DC-6',
+    'DC-8',
+    'DC-9-30',
+    'DH-82',
+    'DHC-1',
+    'DHC-6',
+    'DHC-8-100',
+    'DHC-8-300',
+    'DR-400',
+    'Dornier 328',
+    'E-170',
+    'E-190',
+    'E-195',
+    'EMB-120',
+    'ERJ 135',
+    'ERJ 145',
+    'Embraer Legacy 600',
+    'Eurofighter Typhoon',
+    'F-16A/B',
+    'F/A-18',
+    'Falcon 2000',
+    'Falcon 900',
+    'Fokker 100',
+    'Fokker 50',
+    'Fokker 70',
+    'Global Express',
+    'Gulfstream IV',
+    'Gulfstream V',
+    'Hawk T1',
+    'Il-76',
+    'L-1011',
+    'MD-11',
+    'MD-80',
+    'MD-87',
+    'MD-90',
+    'Metroliner',
+    'Model B200',
+    'PA-28',
+    'SR-20',
+    'Saab 2000',
+    'Saab 340',
+    'Spitfire',
+    'Tornado',
+    'Tu-134',
+    'Tu-154',
+    'Yak-42',
+]
+templates = [
+    'a photo of a {}, a type of aircraft.',
+    'a photo of the {}, a type of aircraft.',
+]
+```
+## FacialEmotionRecognition2013
+```bash
+classes = [
+    ['angry'],
+    ['disgusted'],
+    ['fearful'],
+    ['happy', 'smiling'],
+    ['sad', 'depressed'],
+    ['surprised', 'shocked', 'spooked'],
+    ['neutral', 'bored'],
+]
+templates = [
+    'a photo of a {} looking face.',
+    'a photo of a face showing the emotion: {}.',
+    'a photo of a face looking {}.',
+    'a face that looks {}.',
+    'they look {}.',
+    'look at how {} they are.',
+]
+```
+## Flowers102
+```bash
+classes = [
+    'pink primrose',
+    'hard-leaved pocket orchid',
+    'canterbury bells',
+    'sweet pea',
+    'english marigold',
+    'tiger lily',
+    'moon orchid',
+    'bird of paradise',
+    'monkshood',
+    'globe thistle',
+    'snapdragon',
+    "colt's foot",
+    'king protea',
+    'spear thistle',
+    'yellow iris',
+    'globe flower',
+    'purple coneflower',
+    'peruvian lily',
+    'balloon flower',
+    'giant white arum lily',
+    'fire lily',
+    'pincushion flower',
+    'fritillary',
+    'red ginger',
+    'grape hyacinth',
+    'corn poppy',
+    'prince of wales feathers',
+    'stemless gentian',
+    'artichoke',
+    'sweet william',
+    'carnation',
+    'garden phlox',
+    'love in the mist',
+    'mexican aster',
+    'alpine sea holly',
+    'ruby-lipped cattleya',
+    'cape flower',
+    'great masterwort',
+    'siam tulip',
+    'lenten rose',
+    'barbeton daisy',
+    'daffodil',
+    'sword lily',
+    'poinsettia',
+    'bolero deep blue',
+    'wallflower',
+    'marigold',
+    'buttercup',
+    'oxeye daisy',
+    'common dandelion',
+    'petunia',
+    'wild pansy',
+    'primula',
+    'sunflower',
+    'pelargonium',
+    'bishop of llandaff',
+    'gaura',
+    'geranium',
+    'orange dahlia',
+    'pink and yellow dahlia',
+    'cautleya spicata',
+    'japanese anemone',
+    'black-eyed susan',
+    'silverbush',
+    'californian poppy',
+    'osteospermum',
+    'spring crocus',
+    'bearded iris',
+    'windflower',
+    'tree poppy',
+    'gazania',
+    'azalea',
+    'water lily',
+    'rose',
+    'thorn apple',
+    'morning glory',
+    'passion flower',
+    'lotus',
+    'toad lily',
+    'anthurium',
+    'frangipani',
+    'clematis',
+    'hibiscus',
+    'columbine',
+    'desert-rose',
+    'tree mallow',
+    'magnolia',
+    'cyclamen',
+    'watercress',
+    'canna lily',
+    'hippeastrum',
+    'bee balm',
+    'air plant',
+    'foxglove',
+    'bougainvillea',
+    'camellia',
+    'mallow',
+    'mexican petunia',
+    'bromelia',
+    'blanket flower',
+    'trumpet creeper',
+    'blackberry lily',
+]
+templates = [
+    'a photo of a {}, a type of flower.',
+]
+```
+## Food101
+```bash
+classes = [
+    'apple pie',
+    'baby back ribs',
+    'baklava',
+    'beef carpaccio',
+    'beef tartare',
+    'beet salad',
+    'beignets',
+    'bibimbap',
+    'bread pudding',
+    'breakfast burrito',
+    'bruschetta',
+    'caesar salad',
+    'cannoli',
+    'caprese salad',
+    'carrot cake',
+    'ceviche',
+    'cheese plate',
+    'cheesecake',
+    'chicken curry',
+    'chicken quesadilla',
+    'chicken wings',
+    'chocolate cake',
+    'chocolate mousse',
+    'churros',
+    'clam chowder',
+    'club sandwich',
+    'crab cakes',
+    'creme brulee',
+    'croque madame',
+    'cup cakes',
+    'deviled eggs',
+    'donuts',
+    'dumplings',
+    'edamame',
+    'eggs benedict',
+    'escargots',
+    'falafel',
+    'filet mignon',
+    'fish and chips',
+    'foie gras',
+    'french fries',
+    'french onion soup',
+    'french toast',
+    'fried calamari',
+    'fried rice',
+    'frozen yogurt',
+    'garlic bread',
+    'gnocchi',
+    'greek salad',
+    'grilled cheese sandwich',
+    'grilled salmon',
+    'guacamole',
+    'gyoza',
+    'hamburger',
+    'hot and sour soup',
+    'hot dog',
+    'huevos rancheros',
+    'hummus',
+    'ice cream',
+    'lasagna',
+    'lobster bisque',
+    'lobster roll sandwich',
+    'macaroni and cheese',
+    'macarons',
+    'miso soup',
+    'mussels',
+    'nachos',
+    'omelette',
+    'onion rings',
+    'oysters',
+    'pad thai',
+    'paella',
+    'pancakes',
+    'panna cotta',
+    'peking duck',
+    'pho',
+    'pizza',
+    'pork chop',
+    'poutine',
+    'prime rib',
+    'pulled pork sandwich',
+    'ramen',
+    'ravioli',
+    'red velvet cake',
+    'risotto',
+    'samosa',
+    'sashimi',
+    'scallops',
+    'seaweed salad',
+    'shrimp and grits',
+    'spaghetti bolognese',
+    'spaghetti carbonara',
+    'spring rolls',
+    'steak',
+    'strawberry shortcake',
+    'sushi',
+    'tacos',
+    'takoyaki',
+    'tiramisu',
+    'tuna tartare',
+    'waffles',
+]
+templates = [
+    'a photo of {}, a type of food.',
+]
+```
+## GTSRB
+```bash
+classes = [
+    'red and white circle 20 kph speed limit',
+    'red and white circle 30 kph speed limit',
+    'red and white circle 50 kph speed limit',
+    'red and white circle 60 kph speed limit',
+    'red and white circle 70 kph speed limit',
+    'red and white circle 80 kph speed limit',
+    'end / de-restriction of 80 kph speed limit',
+    'red and white circle 100 kph speed limit',
+    'red and white circle 120 kph speed limit',
+    'red and white circle red car and black car no passing',
+    'red and white circle red truck and black car no passing',
+    'red and white triangle road intersection warning',
+    'white and yellow diamond priority road',
+    'red and white upside down triangle yield right-of-way',
+    'stop',
+    'empty red and white circle',
+    'red and white circle no truck entry',
+    'red circle with white horizonal stripe no entry',
+    'red and white triangle with exclamation mark warning',
+    'red and white triangle with black left curve approaching warning',
+    'red and white triangle with black right curve approaching warning',
+    'red and white triangle with black double curve approaching warning',
+    'red and white triangle rough / bumpy road warning',
+    'red and white triangle car skidding / slipping warning',
+    'red and white triangle with merging / narrow lanes warning',
+    'red and white triangle with person digging / construction / road work warning',
+    'red and white triangle with traffic light approaching warning',
+    'red and white triangle with person walking warning',
+    'red and white triangle with child and person walking warning',
+    'red and white triangle with bicyle warning',
+    'red and white triangle with snowflake / ice warning',
+    'red and white triangle with deer warning',
+    'white circle with gray strike bar no speed limit',
+    'blue circle with white right turn arrow mandatory',
+    'blue circle with white left turn arrow mandatory',
+    'blue circle with white forward arrow mandatory',
+    'blue circle with white forward or right turn arrow mandatory',
+    'blue circle with white forward or left turn arrow mandatory',
+    'blue circle with white keep right arrow mandatory',
+    'blue circle with white keep left arrow mandatory',
+    'blue circle with white arrows indicating a traffic circle',
+    'white circle with gray strike bar indicating no passing for cars has ended',
+    'white circle with gray strike bar indicating no passing for trucks has ended',
+]
+templates = [
+    'a zoomed in photo of a "{}" traffic sign.',
+    'a centered photo of a "{}" traffic sign.',
+    'a close up photo of a "{}" traffic sign.',
+]
+```
+## HatefulMemes
+```bash
+classes = [
+    'meme',
+    'hatespeech meme',
+]
+templates = [
+    'a {}.',
+]
+```
+## KITTI
+```bash
+classes = [
+    'a photo i took of a car on my left or right side.',
+    'a photo i took with a car nearby.',
+    'a photo i took with a car in the distance.',
+    'a photo i took with no car.',
+]
+templates = [
+    '{}',
+]
+```
+## Kinetics700
+```bash
+classes = [
+    'abseiling',
+    'acting in play',
+    'adjusting glasses',
+    'air drumming',
+    'alligator wrestling',
+    'answering questions',
+    'applauding',
+    'applying cream',
+    'archaeological excavation',
+    'archery',
+    'arguing',
+    'arm wrestling',
+    'arranging flowers',
+    'arresting',
+    'assembling bicycle',
+    'assembling computer',
+    'attending conference',
+    'auctioning',
+    'baby waking up',
+    'backflip (human)',
+    'baking cookies',
+    'bandaging',
+    'barbequing',
+    'bartending',
+    'base jumping',
+    'bathing dog',
+    'battle rope training',
+    'beatboxing',
+    'bee keeping',
+    'being excited',
+    'being in zero gravity',
+    'belly dancing',
+    'bench pressing',
+    'bending back',
+    'bending metal',
+    'biking through snow',
+    'blasting sand',
+    'blending fruit',
+    'blowdrying hair',
+    'blowing bubble gum',
+    'blowing glass',
+    'blowing leaves',
+    'blowing nose',
+    'blowing out candles',
+    'bobsledding',
+    'bodysurfing',
+    'bookbinding',
+    'bottling',
+    'bouncing ball (not juggling)',
+    'bouncing on bouncy castle',
+    'bouncing on trampoline',
+    'bowling',
+    'braiding hair',
+    'breading or breadcrumbing',
+    'breakdancing',
+    'breaking boards',
+    'breaking glass',
+    'breathing fire',
+    'brush painting',
+    'brushing floor',
+    'brushing hair',
+    'brushing teeth',
+    'building cabinet',
+    'building lego',
+    'building sandcastle',
+    'building shed',
+    'bulldozing',
+    'bungee jumping',
+    'burping',
+    'busking',
+    'calculating',
+    'calligraphy',
+    'canoeing or kayaking',
+    'capoeira',
+    'capsizing',
+    'card stacking',
+    'card throwing',
+    'carrying baby',
+    'carrying weight',
+    'cartwheeling',
+    'carving ice',
+    'carving marble',
+    'carving pumpkin',
+    'carving wood with a knife',
+    'casting fishing line',
+    'catching fish',
+    'catching or throwing baseball',
+    'catching or throwing frisbee',
+    'catching or throwing softball',
+    'celebrating',
+    'changing gear in car',
+    'changing oil',
+    'changing wheel (not on bike)',
+    'chasing',
+    'checking tires',
+    'checking watch',
+    'cheerleading',
+    'chewing gum',
+    'chiseling stone',
+    'chiseling wood',
+    'chopping meat',
+    'chopping wood',
+    'clam digging',
+    'clapping',
+    'clay pottery making',
+    'clean and jerk',
+    'cleaning gutters',
+    'cleaning pool',
+    'cleaning shoes',
+    'cleaning toilet',
+    'cleaning windows',
+    'climbing a rope',
+    'climbing ladder',
+    'climbing tree',
+    'closing door',
+    'coloring in',
+    'combing hair',
+    'contact juggling',
+    'contorting',
+    'cooking chicken',
+    'cooking egg',
+    'cooking on campfire',
+    'cooking sausages (not on barbeque)',
+    'cooking scallops',
+    'cosplaying',
+    'coughing',
+    'counting money',
+    'country line dancing',
+    'cracking back',
+    'cracking knuckles',
+    'cracking neck',
+    'crawling baby',
+    'crocheting',
+    'crossing eyes',
+    'crossing river',
+    'crying',
+    'cumbia',
+    'curling (sport)',
+    'curling eyelashes',
+    'curling hair',
+    'cutting apple',
+    'cutting cake',
+    'cutting nails',
+    'cutting orange',
+    'cutting pineapple',
+    'cutting watermelon',
+    'dancing ballet',
+    'dancing charleston',
+    'dancing gangnam style',
+    'dancing macarena',
+    'deadlifting',
+    'dealing cards',
+    'decorating the christmas tree',
+    'decoupage',
+    'delivering mail',
+    'digging',
+    'dining',
+    'directing traffic',
+    'disc golfing',
+    'diving cliff',
+    'docking boat',
+    'dodgeball',
+    'doing aerobics',
+    'doing jigsaw puzzle',
+    'doing laundry',
+    'doing nails',
+    'doing sudoku',
+    'drawing',
+    'dribbling basketball',
+    'drinking shots',
+    'driving car',
+    'driving tractor',
+    'drooling',
+    'drop kicking',
+    'drumming fingers',
+    'dumpster diving',
+    'dunking basketball',
+    'dyeing eyebrows',
+    'dyeing hair',
+    'eating burger',
+    'eating cake',
+    'eating carrots',
+    'eating chips',
+    'eating doughnuts',
+    'eating hotdog',
+    'eating ice cream',
+    'eating nachos',
+    'eating spaghetti',
+    'eating watermelon',
+    'egg hunting',
+    'embroidering',
+    'entering church',
+    'exercising arm',
+    'exercising with an exercise ball',
+    'extinguishing fire',
+    'faceplanting',
+    'falling off bike',
+    'falling off chair',
+    'feeding birds',
+    'feeding fish',
+    'feeding goats',
+    'fencing (sport)',
+    'fidgeting',
+    'filling cake',
+    'filling eyebrows',
+    'finger snapping',
+    'fixing bicycle',
+    'fixing hair',
+    'flint knapping',
+    'flipping bottle',
+    'flipping pancake',
+    'fly tying',
+    'flying kite',
+    'folding clothes',
+    'folding napkins',
+    'folding paper',
+    'front raises',
+    'frying vegetables',
+    'gargling',
+    'geocaching',
+    'getting a haircut',
+    'getting a piercing',
+    'getting a tattoo',
+    'giving or receiving award',
+    'gold panning',
+    'golf chipping',
+    'golf driving',
+    'golf putting',
+    'gospel singing in church',
+    'grinding meat',
+    'grooming cat',
+    'grooming dog',
+    'grooming horse',
+    'gymnastics tumbling',
+    'hammer throw',
+    'hand washing clothes',
+    'head stand',
+    'headbanging',
+    'headbutting',
+    'helmet diving',
+    'herding cattle',
+    'high fiving',
+    'high jump',
+    'high kick',
+    'historical reenactment',
+    'hitting baseball',
+    'hockey stop',
+    'holding snake',
+    'home roasting coffee',
+    'hopscotch',
+    'hoverboarding',
+    'huddling',
+    'hugging (not baby)',
+    'hugging baby',
+    'hula hooping',
+    'hurdling',
+    'hurling (sport)',
+    'ice climbing',
+    'ice fishing',
+    'ice skating',
+    'ice swimming',
+    'inflating balloons',
+    'installing carpet',
+    'ironing',
+    'ironing hair',
+    'javelin throw',
+    'jaywalking',
+    'jetskiing',
+    'jogging',
+    'juggling balls',
+    'juggling fire',
+    'juggling soccer ball',
+    'jumping bicycle',
+    'jumping into pool',
+    'jumping jacks',
+    'jumping sofa',
+    'jumpstyle dancing',
+    'karaoke',
+    'kicking field goal',
+    'kicking soccer ball',
+    'kissing',
+    'kitesurfing',
+    'knitting',
+    'krumping',
+    'land sailing',
+    'laughing',
+    'lawn mower racing',
+    'laying bricks',
+    'laying concrete',
+    'laying decking',
+    'laying stone',
+    'laying tiles',
+    'leatherworking',
+    'letting go of balloon',
+    'licking',
+    'lifting hat',
+    'lighting candle',
+    'lighting fire',
+    'listening with headphones',
+    'lock picking',
+    'long jump',
+    'longboarding',
+    'looking at phone',
+    'looking in mirror',
+    'luge',
+    'lunge',
+    'making a cake',
+    'making a sandwich',
+    'making balloon shapes',
+    'making bubbles',
+    'making cheese',
+    'making horseshoes',
+    'making jewelry',
+    'making latte art',
+    'making paper aeroplanes',
+    'making pizza',
+    'making slime',
+    'making snowman',
+    'making sushi',
+    'making tea',
+    'making the bed',
+    'marching',
+    'marriage proposal',
+    'massaging back',
+    'massaging feet',
+    'massaging legs',
+    'massaging neck',
+    "massaging person's head",
+    'metal detecting',
+    'milking cow',
+    'milking goat',
+    'mixing colours',
+    'moon walking',
+    'mopping floor',
+    'mosh pit dancing',
+    'motorcycling',
+    'mountain climber (exercise)',
+    'moving baby',
+    'moving child',
+    'moving furniture',
+    'mowing lawn',
+    'mushroom foraging',
+    'needle felting',
+    'news anchoring',
+    'opening bottle (not wine)',
+    'opening coconuts',
+    'opening door',
+    'opening present',
+    'opening refrigerator',
+    'opening wine bottle',
+    'packing',
+    'paragliding',
+    'parasailing',
+    'parkour',
+    'passing American football (in game)',
+    'passing American football (not in game)',
+    'passing soccer ball',
+    'peeling apples',
+    'peeling banana',
+    'peeling potatoes',
+    'person collecting garbage',
+    'petting animal (not cat)',
+    'petting cat',
+    'petting horse',
+    'photobombing',
+    'photocopying',
+    'picking apples',
+    'picking blueberries',
+    'pillow fight',
+    'pinching',
+    'pirouetting',
+    'planing wood',
+    'planting trees',
+    'plastering',
+    'playing accordion',
+    'playing american football',
+    'playing badminton',
+    'playing bagpipes',
+    'playing basketball',
+    'playing bass guitar',
+    'playing beer pong',
+    'playing billiards',
+    'playing blackjack',
+    'playing cards',
+    'playing cello',
+    'playing checkers',
+    'playing chess',
+    'playing clarinet',
+    'playing controller',
+    'playing cricket',
+    'playing cymbals',
+    'playing darts',
+    'playing didgeridoo',
+    'playing dominoes',
+    'playing drums',
+    'playing field hockey',
+    'playing flute',
+    'playing gong',
+    'playing guitar',
+    'playing hand clapping games',
+    'playing harmonica',
+    'playing harp',
+    'playing ice hockey',
+    'playing keyboard',
+    'playing kickball',
+    'playing laser tag',
+    'playing lute',
+    'playing mahjong',
+    'playing maracas',
+    'playing marbles',
+    'playing monopoly',
+    'playing netball',
+    'playing nose flute',
+    'playing oboe',
+    'playing ocarina',
+    'playing organ',
+    'playing paintball',
+    'playing pan pipes',
+    'playing piano',
+    'playing piccolo',
+    'playing pinball',
+    'playing ping pong',
+    'playing poker',
+    'playing polo',
+    'playing recorder',
+    'playing road hockey',
+    'playing rounders',
+    'playing rubiks cube',
+    'playing saxophone',
+    'playing scrabble',
+    'playing shuffleboard',
+    'playing slot machine',
+    'playing squash or racquetball',
+    'playing tennis',
+    'playing trombone',
+    'playing trumpet',
+    'playing ukulele',
+    'playing violin',
+    'playing volleyball',
+    'playing with trains',
+    'playing xylophone',
+    'poaching eggs',
+    'poking bellybutton',
+    'pole vault',
+    'polishing furniture',
+    'polishing metal',
+    'popping balloons',
+    'pouring beer',
+    'pouring milk',
+    'pouring wine',
+    'preparing salad',
+    'presenting weather forecast',
+    'pretending to be a statue',
+    'pull ups',
+    'pulling espresso shot',
+    'pulling rope (game)',
+    'pumping fist',
+    'pumping gas',
+    'punching bag',
+    'punching person (boxing)',
+    'push up',
+    'pushing car',
+    'pushing cart',
+    'pushing wheelbarrow',
+    'pushing wheelchair',
+    'putting in contact lenses',
+    'putting on eyeliner',
+    'putting on foundation',
+    'putting on lipstick',
+    'putting on mascara',
+    'putting on sari',
+    'putting on shoes',
+    'putting wallpaper on wall',
+    'raising eyebrows',
+    'reading book',
+    'reading newspaper',
+    'recording music',
+    'repairing puncture',
+    'riding a bike',
+    'riding camel',
+    'riding elephant',
+    'riding mechanical bull',
+    'riding mule',
+    'riding or walking with horse',
+    'riding scooter',
+    'riding snow blower',
+    'riding unicycle',
+    'ripping paper',
+    'roasting marshmallows',
+    'roasting pig',
+    'robot dancing',
+    'rock climbing',
+    'rock scissors paper',
+    'roller skating',
+    'rolling eyes',
+    'rolling pastry',
+    'rope pushdown',
+    'running on treadmill',
+    'sailing',
+    'salsa dancing',
+    'saluting',
+    'sanding floor',
+    'sanding wood',
+    'sausage making',
+    'sawing wood',
+    'scrambling eggs',
+    'scrapbooking',
+    'scrubbing face',
+    'scuba diving',
+    'seasoning food',
+    'separating eggs',
+    'setting table',
+    'sewing',
+    'shaking hands',
+    'shaking head',
+    'shaping bread dough',
+    'sharpening knives',
+    'sharpening pencil',
+    'shaving head',
+    'shaving legs',
+    'shearing sheep',
+    'shining flashlight',
+    'shining shoes',
+    'shoot dance',
+    'shooting basketball',
+    'shooting goal (soccer)',
+    'shooting off fireworks',
+    'shopping',
+    'shot put',
+    'shouting',
+    'shoveling snow',
+    'shredding paper',
+    'shucking oysters',
+    'shuffling cards',
+    'shuffling feet',
+    'side kick',
+    'sieving',
+    'sign language interpreting',
+    'silent disco',
+    'singing',
+    'sipping cup',
+    'situp',
+    'skateboarding',
+    'ski ballet',
+    'ski jumping',
+    'skiing crosscountry',
+    'skiing mono',
+    'skiing slalom',
+    'skipping rope',
+    'skipping stone',
+    'skydiving',
+    'slacklining',
+    'slapping',
+    'sled dog racing',
+    'sleeping',
+    'slicing onion',
+    'smashing',
+    'smelling feet',
+    'smoking',
+    'smoking hookah',
+    'smoking pipe',
+    'snatch weight lifting',
+    'sneezing',
+    'snorkeling',
+    'snowboarding',
+    'snowkiting',
+    'snowmobiling',
+    'somersaulting',
+    'spelunking',
+    'spinning plates',
+    'spinning poi',
+    'splashing water',
+    'spray painting',
+    'spraying',
+    'springboard diving',
+    'square dancing',
+    'squat',
+    'squeezing orange',
+    'stacking cups',
+    'stacking dice',
+    'standing on hands',
+    'staring',
+    'steer roping',
+    'steering car',
+    'sticking tongue out',
+    'stomping grapes',
+    'stretching arm',
+    'stretching leg',
+    'sucking lolly',
+    'surfing crowd',
+    'surfing water',
+    'surveying',
+    'sweeping floor',
+    'swimming backstroke',
+    'swimming breast stroke',
+    'swimming butterfly stroke',
+    'swimming front crawl',
+    'swimming with dolphins',
+    'swimming with sharks',
+    'swing dancing',
+    'swinging baseball bat',
+    'swinging on something',
+    'sword fighting',
+    'sword swallowing',
+    'tackling',
+    'tagging graffiti',
+    'tai chi',
+    'taking photo',
+    'talking on cell phone',
+    'tango dancing',
+    'tap dancing',
+    'tapping guitar',
+    'tapping pen',
+    'tasting beer',
+    'tasting food',
+    'tasting wine',
+    'testifying',
+    'texting',
+    'threading needle',
+    'throwing axe',
+    'throwing ball (not baseball or American football)',
+    'throwing discus',
+    'throwing knife',
+    'throwing snowballs',
+    'throwing tantrum',
+    'throwing water balloon',
+    'tickling',
+    'tie dying',
+    'tightrope walking',
+    'tiptoeing',
+    'tobogganing',
+    'tossing coin',
+    'tossing salad',
+    'training dog',
+    'trapezing',
+    'treating wood',
+    'trimming or shaving beard',
+    'trimming shrubs',
+    'trimming trees',
+    'triple jump',
+    'twiddling fingers',
+    'tying bow tie',
+    'tying knot (not on a tie)',
+    'tying necktie',
+    'tying shoe laces',
+    'unboxing',
+    'uncorking champagne',
+    'unloading truck',
+    'using a microscope',
+    'using a paint roller',
+    'using a power drill',
+    'using a sledge hammer',
+    'using a wrench',
+    'using atm',
+    'using bagging machine',
+    'using circular saw',
+    'using inhaler',
+    'using megaphone',
+    'using puppets',
+    'using remote controller (not gaming)',
+    'using segway',
+    'vacuuming car',
+    'vacuuming floor',
+    'visiting the zoo',
+    'wading through mud',
+    'wading through water',
+    'waiting in line',
+    'waking up',
+    'walking on stilts',
+    'walking the dog',
+    'walking through snow',
+    'walking with crutches',
+    'washing dishes',
+    'washing feet',
+    'washing hair',
+    'washing hands',
+    'watching tv',
+    'water skiing',
+    'water sliding',
+    'watering plants',
+    'waving hand',
+    'waxing armpits',
+    'waxing back',
+    'waxing chest',
+    'waxing eyebrows',
+    'waxing legs',
+    'weaving basket',
+    'weaving fabric',
+    'welding',
+    'whistling',
+    'windsurfing',
+    'winking',
+    'wood burning (art)',
+    'wrapping present',
+    'wrestling',
+    'writing',
+    'yarn spinning',
+    'yawning',
+    'yoga',
+    'zumba'
+]
+templates = [
+    'a photo of {}.',
+    'a photo of a person {}.',
+    'a photo of a person using {}.',
+    'a photo of a person doing {}.',
+    'a photo of a person during {}.',
+    'a photo of a person performing {}.',
+    'a photo of a person practicing {}.',
+    'a video of {}.',
+    'a video of a person {}.',
+    'a video of a person using {}.',
+    'a video of a person doing {}.',
+    'a video of a person during {}.',
+    'a video of a person performing {}.',
+    'a video of a person practicing {}.',
+    'a example of {}.',
+    'a example of a person {}.',
+    'a example of a person using {}.',
+    'a example of a person doing {}.',
+    'a example of a person during {}.',
+    'a example of a person performing {}.',
+    'a example of a person practicing {}.',
+    'a demonstration of {}.',
+    'a demonstration of a person {}.',
+    'a demonstration of a person using {}.',
+    'a demonstration of a person doing {}.',
+    'a demonstration of a person during {}.',
+    'a demonstration of a person performing {}.',
+    'a demonstration of a person practicing {}.',
+]
+```
+## MNIST
+```bash
+classes = [
+    '0',
+    '1',
+    '2',
+    '3',
+    '4',
+    '5',
+    '6',
+    '7',
+    '8',
+    '9',
+]
+templates = [
+    'a photo of the number: "{}".',
+]
+```
+## OxfordPets
+```bash
+classes = [
+    'Abyssinian',
+    'Bengal',
+    'Birman',
+    'Bombay',
+    'British Shorthair',
+    'Egyptian Mau',
+    'Maine Coon',
+    'Persian',
+    'Ragdoll',
+    'Russian Blue',
+    'Siamese',
+    'Sphynx',
+    'american bulldog',
+    'american pit bull terrier',
+    'basset hound',
+    'beagle',
+    'boxer',
+    'chihuahua',
+    'english cocker spaniel',
+    'english setter',
+    'german shorthaired',
+    'great pyrenees',
+    'havanese',
+    'japanese chin',
+    'keeshond',
+    'leonberger',
+    'miniature pinscher',
+    'newfoundland',
+    'pomeranian',
+    'pug',
+    'saint bernard',
+    'samoyed',
+    'scottish terrier',
+    'shiba inu',
+    'staffordshire bull terrier',
+    'wheaten terrier',
+    'yorkshire terrier',
+]
+templates = [
+    'a photo of a {}, a type of pet.',
+]
+```
+## PascalVOC2007
+```bash
+classes = [
+    'aeroplane',
+    'bicycle',
+    'bird',
+    'boat',
+    'bottle',
+    'bus',
+    'car',
+    'cat',
+    'chair',
+    'cow',
+    'dog',
+    'horse',
+    'motorbike',
+    'person',
+    'sheep',
+    'sofa',
+    'diningtable',
+    'pottedplant',
+    'train',
+    'tvmonitor',
+]
+templates = [
+    'a photo of a {}.',
+]
+```
+## PatchCamelyon
+```bash
+classes = [
+    'lymph node',
+    'lymph node containing metastatic tumor tissue',
+]
+templates = [
+    'this is a photo of {}',
+]
+```
+## RESISC45
+```bash
+classes = [
+    'airplane',
+    'airport',
+    'baseball diamond',
+    'basketball court',
+    'beach',
+    'bridge',
+    'chaparral',
+    'church',
+    'circular farmland',
+    'cloud',
+    'commercial area',
+    'dense residential',
+    'desert',
+    'forest',
+    'freeway',
+    'golf course',
+    'ground track field',
+    'harbor',
+    'industrial area',
+    'intersection',
+    'island',
+    'lake',
+    'meadow',
+    'medium residential',
+    'mobile home park',
+    'mountain',
+    'overpass',
+    'palace',
+    'parking lot',
+    'railway',
+    'railway station',
+    'rectangular farmland',
+    'river',
+    'roundabout',
+    'runway',
+    'sea ice',
+    'ship',
+    'snowberg',
+    'sparse residential',
+    'stadium',
+    'storage tank',
+    'tennis court',
+    'terrace',
+    'thermal power station',
+    'wetland',
+]
+templates = [
+    'satellite imagery of {}.',
+    'aerial imagery of {}.',
+    'satellite photo of {}.',
+    'aerial photo of {}.',
+    'satellite view of {}.',
+    'aerial view of {}.',
+    'satellite imagery of a {}.',
+    'aerial imagery of a {}.',
+    'satellite photo of a {}.',
+    'aerial photo of a {}.',
+    'satellite view of a {}.',
+    'aerial view of a {}.',
+    'satellite imagery of the {}.',
+    'aerial imagery of the {}.',
+    'satellite photo of the {}.',
+    'aerial photo of the {}.',
+    'satellite view of the {}.',
+    'aerial view of the {}.',
+]
+```
+## SST2
+```bash
+classes = [
+    'negative',
+    'positive',
+]
+templates = [
+    'a {} review of a movie.',
+]
+```
+## STL10
+```bash
+classes = [
+    'airplane',
+    'bird',
+    'car',
+    'cat',
+    'deer',
+    'dog',
+    'horse',
+    'monkey',
+    'ship',
+    'truck',
+]
+templates = [
+    'a photo of a {}.',
+    'a photo of the {}.',
+]
+```
+## SUN397
+```bash
+classes = [
+    'abbey',
+    'airplane cabin',
+    'airport terminal',
+    'alley',
+    'amphitheater',
+    'amusement arcade',
+    'amusement park',
+    'anechoic chamber',
+    'apartment building outdoor',
+    'apse indoor',
+    'aquarium',
+    'aqueduct',
+    'arch',
+    'archive',
+    'arrival gate outdoor',
+    'art gallery',
+    'art school',
+    'art studio',
+    'assembly line',
+    'athletic field outdoor',
+    'atrium public',
+    'attic',
+    'auditorium',
+    'auto factory',
+    'badlands',
+    'badminton court indoor',
+    'baggage claim',
+    'bakery shop',
+    'balcony exterior',
+    'balcony interior',
+    'ball pit',
+    'ballroom',
+    'bamboo forest',
+    'banquet hall',
+    'bar',
+    'barn',
+    'barndoor',
+    'baseball field',
+    'basement',
+    'basilica',
+    'basketball court outdoor',
+    'bathroom',
+    'batters box',
+    'bayou',
+    'bazaar indoor',
+    'bazaar outdoor',
+    'beach',
+    'beauty salon',
+    'bedroom',
+    'berth',
+    'biology laboratory',
+    'bistro indoor',
+    'boardwalk',
+    'boat deck',
+    'boathouse',
+    'bookstore',
+    'booth indoor',
+    'botanical garden',
+    'bow window indoor',
+    'bow window outdoor',
+    'bowling alley',
+    'boxing ring',
+    'brewery indoor',
+    'bridge',
+    'building facade',
+    'bullring',
+    'burial chamber',
+    'bus interior',
+    'butchers shop',
+    'butte',
+    'cabin outdoor',
+    'cafeteria',
+    'campsite',
+    'campus',
+    'canal natural',
+    'canal urban',
+    'candy store',
+    'canyon',
+    'car interior backseat',
+    'car interior frontseat',
+    'carrousel',
+    'casino indoor',
+    'castle',
+    'catacomb',
+    'cathedral indoor',
+    'cathedral outdoor',
+    'cavern indoor',
+    'cemetery',
+    'chalet',
+    'cheese factory',
+    'chemistry lab',
+    'chicken coop indoor',
+    'chicken coop outdoor',
+    'childs room',
+    'church indoor',
+    'church outdoor',
+    'classroom',
+    'clean room',
+    'cliff',
+    'cloister indoor',
+    'closet',
+    'clothing store',
+    'coast',
+    'cockpit',
+    'coffee shop',
+    'computer room',
+    'conference center',
+    'conference room',
+    'construction site',
+    'control room',
+    'control tower outdoor',
+    'corn field',
+    'corral',
+    'corridor',
+    'cottage garden',
+    'courthouse',
+    'courtroom',
+    'courtyard',
+    'covered bridge exterior',
+    'creek',
+    'crevasse',
+    'crosswalk',
+    'cubicle office',
+    'dam',
+    'delicatessen',
+    'dentists office',
+    'desert sand',
+    'desert vegetation',
+    'diner indoor',
+    'diner outdoor',
+    'dinette home',
+    'dinette vehicle',
+    'dining car',
+    'dining room',
+    'discotheque',
+    'dock',
+    'doorway outdoor',
+    'dorm room',
+    'driveway',
+    'driving range outdoor',
+    'drugstore',
+    'electrical substation',
+    'elevator door',
+    'elevator interior',
+    'elevator shaft',
+    'engine room',
+    'escalator indoor',
+    'excavation',
+    'factory indoor',
+    'fairway',
+    'fastfood restaurant',
+    'field cultivated',
+    'field wild',
+    'fire escape',
+    'fire station',
+    'firing range indoor',
+    'fishpond',
+    'florist shop indoor',
+    'food court',
+    'forest broadleaf',
+    'forest needleleaf',
+    'forest path',
+    'forest road',
+    'formal garden',
+    'fountain',
+    'galley',
+    'game room',
+    'garage indoor',
+    'garbage dump',
+    'gas station',
+    'gazebo exterior',
+    'general store indoor',
+    'general store outdoor',
+    'gift shop',
+    'golf course',
+    'greenhouse indoor',
+    'greenhouse outdoor',
+    'gymnasium indoor',
+    'hangar indoor',
+    'hangar outdoor',
+    'harbor',
+    'hayfield',
+    'heliport',
+    'herb garden',
+    'highway',
+    'hill',
+    'home office',
+    'hospital',
+    'hospital room',
+    'hot spring',
+    'hot tub outdoor',
+    'hotel outdoor',
+    'hotel room',
+    'house',
+    'hunting lodge outdoor',
+    'ice cream parlor',
+    'ice floe',
+    'ice shelf',
+    'ice skating rink indoor',
+    'ice skating rink outdoor',
+    'iceberg',
+    'igloo',
+    'industrial area',
+    'inn outdoor',
+    'islet',
+    'jacuzzi indoor',
+    'jail cell',
+    'jail indoor',
+    'jewelry shop',
+    'kasbah',
+    'kennel indoor',
+    'kennel outdoor',
+    'kindergarden classroom',
+    'kitchen',
+    'kitchenette',
+    'labyrinth outdoor',
+    'lake natural',
+    'landfill',
+    'landing deck',
+    'laundromat',
+    'lecture room',
+    'library indoor',
+    'library outdoor',
+    'lido deck outdoor',
+    'lift bridge',
+    'lighthouse',
+    'limousine interior',
+    'living room',
+    'lobby',
+    'lock chamber',
+    'locker room',
+    'mansion',
+    'manufactured home',
+    'market indoor',
+    'market outdoor',
+    'marsh',
+    'martial arts gym',
+    'mausoleum',
+    'medina',
+    'moat water',
+    'monastery outdoor',
+    'mosque indoor',
+    'mosque outdoor',
+    'motel',
+    'mountain',
+    'mountain snowy',
+    'movie theater indoor',
+    'museum indoor',
+    'music store',
+    'music studio',
+    'nuclear power plant outdoor',
+    'nursery',
+    'oast house',
+    'observatory outdoor',
+    'ocean',
+    'office',
+    'office building',
+    'oil refinery outdoor',
+    'oilrig',
+    'operating room',
+    'orchard',
+    'outhouse outdoor',
+    'pagoda',
+    'palace',
+    'pantry',
+    'park',
+    'parking garage indoor',
+    'parking garage outdoor',
+    'parking lot',
+    'parlor',
+    'pasture',
+    'patio',
+    'pavilion',
+    'pharmacy',
+    'phone booth',
+    'physics laboratory',
+    'picnic area',
+    'pilothouse indoor',
+    'planetarium outdoor',
+    'playground',
+    'playroom',
+    'plaza',
+    'podium indoor',
+    'podium outdoor',
+    'pond',
+    'poolroom establishment',
+    'poolroom home',
+    'power plant outdoor',
+    'promenade deck',
+    'pub indoor',
+    'pulpit',
+    'putting green',
+    'racecourse',
+    'raceway',
+    'raft',
+    'railroad track',
+    'rainforest',
+    'reception',
+    'recreation room',
+    'residential neighborhood',
+    'restaurant',
+    'restaurant kitchen',
+    'restaurant patio',
+    'rice paddy',
+    'riding arena',
+    'river',
+    'rock arch',
+    'rope bridge',
+    'ruin',
+    'runway',
+    'sandbar',
+    'sandbox',
+    'sauna',
+    'schoolhouse',
+    'sea cliff',
+    'server room',
+    'shed',
+    'shoe shop',
+    'shopfront',
+    'shopping mall indoor',
+    'shower',
+    'skatepark',
+    'ski lodge',
+    'ski resort',
+    'ski slope',
+    'sky',
+    'skyscraper',
+    'slum',
+    'snowfield',
+    'squash court',
+    'stable',
+    'stadium baseball',
+    'stadium football',
+    'stage indoor',
+    'staircase',
+    'street',
+    'subway interior',
+    'subway station platform',
+    'supermarket',
+    'sushi bar',
+    'swamp',
+    'swimming pool indoor',
+    'swimming pool outdoor',
+    'synagogue indoor',
+    'synagogue outdoor',
+    'television studio',
+    'temple east asia',
+    'temple south asia',
+    'tennis court indoor',
+    'tennis court outdoor',
+    'tent outdoor',
+    'theater indoor procenium',
+    'theater indoor seats',
+    'thriftshop',
+    'throne room',
+    'ticket booth',
+    'toll plaza',
+    'topiary garden',
+    'tower',
+    'toyshop',
+    'track outdoor',
+    'train railway',
+    'train station platform',
+    'tree farm',
+    'tree house',
+    'trench',
+    'underwater coral reef',
+    'utility room',
+    'valley',
+    'van interior',
+    'vegetable garden',
+    'veranda',
+    'veterinarians office',
+    'viaduct',
+    'videostore',
+    'village',
+    'vineyard',
+    'volcano',
+    'volleyball court indoor',
+    'volleyball court outdoor',
+    'waiting room',
+    'warehouse indoor',
+    'water tower',
+    'waterfall block',
+    'waterfall fan',
+    'waterfall plunge',
+    'watering hole',
+    'wave',
+    'wet bar',
+    'wheat field',
+    'wind farm',
+    'windmill',
+    'wine cellar barrel storage',
+    'wine cellar bottle storage',
+    'wrestling ring indoor',
+    'yard',
+    'youth hostel',
+]
+templates = [
+    'a photo of a {}.',
+    'a photo of the {}.',
+]
+```
+## StanfordCars
+```bash
+classes = [
+    'AM General Hummer SUV 2000',
+    'Acura RL Sedan 2012',
+    'Acura TL Sedan 2012',
+    'Acura TL Type-S 2008',
+    'Acura TSX Sedan 2012',
+    'Acura Integra Type R 2001',
+    'Acura ZDX Hatchback 2012',
+    'Aston Martin V8 Vantage Convertible 2012',
+    'Aston Martin V8 Vantage Coupe 2012',
+    'Aston Martin Virage Convertible 2012',
+    'Aston Martin Virage Coupe 2012',
+    'Audi RS 4 Convertible 2008',
+    'Audi A5 Coupe 2012',
+    'Audi TTS Coupe 2012',
+    'Audi R8 Coupe 2012',
+    'Audi V8 Sedan 1994',
+    'Audi 100 Sedan 1994',
+    'Audi 100 Wagon 1994',
+    'Audi TT Hatchback 2011',
+    'Audi S6 Sedan 2011',
+    'Audi S5 Convertible 2012',
+    'Audi S5 Coupe 2012',
+    'Audi S4 Sedan 2012',
+    'Audi S4 Sedan 2007',
+    'Audi TT RS Coupe 2012',
+    'BMW ActiveHybrid 5 Sedan 2012',
+    'BMW 1 Series Convertible 2012',
+    'BMW 1 Series Coupe 2012',
+    'BMW 3 Series Sedan 2012',
+    'BMW 3 Series Wagon 2012',
+    'BMW 6 Series Convertible 2007',
+    'BMW X5 SUV 2007',
+    'BMW X6 SUV 2012',
+    'BMW M3 Coupe 2012',
+    'BMW M5 Sedan 2010',
+    'BMW M6 Convertible 2010',
+    'BMW X3 SUV 2012',
+    'BMW Z4 Convertible 2012',
+    'Bentley Continental Supersports Conv. Convertible 2012',
+    'Bentley Arnage Sedan 2009',
+    'Bentley Mulsanne Sedan 2011',
+    'Bentley Continental GT Coupe 2012',
+    'Bentley Continental GT Coupe 2007',
+    'Bentley Continental Flying Spur Sedan 2007',
+    'Bugatti Veyron 16.4 Convertible 2009',
+    'Bugatti Veyron 16.4 Coupe 2009',
+    'Buick Regal GS 2012',
+    'Buick Rainier SUV 2007',
+    'Buick Verano Sedan 2012',
+    'Buick Enclave SUV 2012',
+    'Cadillac CTS-V Sedan 2012',
+    'Cadillac SRX SUV 2012',
+    'Cadillac Escalade EXT Crew Cab 2007',
+    'Chevrolet Silverado 1500 Hybrid Crew Cab 2012',
+    'Chevrolet Corvette Convertible 2012',
+    'Chevrolet Corvette ZR1 2012',
+    'Chevrolet Corvette Ron Fellows Edition Z06 2007',
+    'Chevrolet Traverse SUV 2012',
+    'Chevrolet Camaro Convertible 2012',
+    'Chevrolet HHR SS 2010',
+    'Chevrolet Impala Sedan 2007',
+    'Chevrolet Tahoe Hybrid SUV 2012',
+    'Chevrolet Sonic Sedan 2012',
+    'Chevrolet Express Cargo Van 2007',
+    'Chevrolet Avalanche Crew Cab 2012',
+    'Chevrolet Cobalt SS 2010',
+    'Chevrolet Malibu Hybrid Sedan 2010',
+    'Chevrolet TrailBlazer SS 2009',
+    'Chevrolet Silverado 2500HD Regular Cab 2012',
+    'Chevrolet Silverado 1500 Classic Extended Cab 2007',
+    'Chevrolet Express Van 2007',
+    'Chevrolet Monte Carlo Coupe 2007',
+    'Chevrolet Malibu Sedan 2007',
+    'Chevrolet Silverado 1500 Extended Cab 2012',
+    'Chevrolet Silverado 1500 Regular Cab 2012',
+    'Chrysler Aspen SUV 2009',
+    'Chrysler Sebring Convertible 2010',
+    'Chrysler Town and Country Minivan 2012',
+    'Chrysler 300 SRT-8 2010',
+    'Chrysler Crossfire Convertible 2008',
+    'Chrysler PT Cruiser Convertible 2008',
+    'Daewoo Nubira Wagon 2002',
+    'Dodge Caliber Wagon 2012',
+    'Dodge Caliber Wagon 2007',
+    'Dodge Caravan Minivan 1997',
+    'Dodge Ram Pickup 3500 Crew Cab 2010',
+    'Dodge Ram Pickup 3500 Quad Cab 2009',
+    'Dodge Sprinter Cargo Van 2009',
+    'Dodge Journey SUV 2012',
+    'Dodge Dakota Crew Cab 2010',
+    'Dodge Dakota Club Cab 2007',
+    'Dodge Magnum Wagon 2008',
+    'Dodge Challenger SRT8 2011',
+    'Dodge Durango SUV 2012',
+    'Dodge Durango SUV 2007',
+    'Dodge Charger Sedan 2012',
+    'Dodge Charger SRT-8 2009',
+    'Eagle Talon Hatchback 1998',
+    'FIAT 500 Abarth 2012',
+    'FIAT 500 Convertible 2012',
+    'Ferrari FF Coupe 2012',
+    'Ferrari California Convertible 2012',
+    'Ferrari 458 Italia Convertible 2012',
+    'Ferrari 458 Italia Coupe 2012',
+    'Fisker Karma Sedan 2012',
+    'Ford F-450 Super Duty Crew Cab 2012',
+    'Ford Mustang Convertible 2007',
+    'Ford Freestar Minivan 2007',
+    'Ford Expedition EL SUV 2009',
+    'Ford Edge SUV 2012',
+    'Ford Ranger SuperCab 2011',
+    'Ford GT Coupe 2006',
+    'Ford F-150 Regular Cab 2012',
+    'Ford F-150 Regular Cab 2007',
+    'Ford Focus Sedan 2007',
+    'Ford E-Series Wagon Van 2012',
+    'Ford Fiesta Sedan 2012',
+    'GMC Terrain SUV 2012',
+    'GMC Savana Van 2012',
+    'GMC Yukon Hybrid SUV 2012',
+    'GMC Acadia SUV 2012',
+    'GMC Canyon Extended Cab 2012',
+    'Geo Metro Convertible 1993',
+    'HUMMER H3T Crew Cab 2010',
+    'HUMMER H2 SUT Crew Cab 2009',
+    'Honda Odyssey Minivan 2012',
+    'Honda Odyssey Minivan 2007',
+    'Honda Accord Coupe 2012',
+    'Honda Accord Sedan 2012',
+    'Hyundai Veloster Hatchback 2012',
+    'Hyundai Santa Fe SUV 2012',
+    'Hyundai Tucson SUV 2012',
+    'Hyundai Veracruz SUV 2012',
+    'Hyundai Sonata Hybrid Sedan 2012',
+    'Hyundai Elantra Sedan 2007',
+    'Hyundai Accent Sedan 2012',
+    'Hyundai Genesis Sedan 2012',
+    'Hyundai Sonata Sedan 2012',
+    'Hyundai Elantra Touring Hatchback 2012',
+    'Hyundai Azera Sedan 2012',
+    'Infiniti G Coupe IPL 2012',
+    'Infiniti QX56 SUV 2011',
+    'Isuzu Ascender SUV 2008',
+    'Jaguar XK XKR 2012',
+    'Jeep Patriot SUV 2012',
+    'Jeep Wrangler SUV 2012',
+    'Jeep Liberty SUV 2012',
+    'Jeep Grand Cherokee SUV 2012',
+    'Jeep Compass SUV 2012',
+    'Lamborghini Reventon Coupe 2008',
+    'Lamborghini Aventador Coupe 2012',
+    'Lamborghini Gallardo LP 570-4 Superleggera 2012',
+    'Lamborghini Diablo Coupe 2001',
+    'Land Rover Range Rover SUV 2012',
+    'Land Rover LR2 SUV 2012',
+    'Lincoln Town Car Sedan 2011',
+    'MINI Cooper Roadster Convertible 2012',
+    'Maybach Landaulet Convertible 2012',
+    'Mazda Tribute SUV 2011',
+    'McLaren MP4-12C Coupe 2012',
+    'Mercedes-Benz 300-Class Convertible 1993',
+    'Mercedes-Benz C-Class Sedan 2012',
+    'Mercedes-Benz SL-Class Coupe 2009',
+    'Mercedes-Benz E-Class Sedan 2012',
+    'Mercedes-Benz S-Class Sedan 2012',
+    'Mercedes-Benz Sprinter Van 2012',
+    'Mitsubishi Lancer Sedan 2012',
+    'Nissan Leaf Hatchback 2012',
+    'Nissan NV Passenger Van 2012',
+    'Nissan Juke Hatchback 2012',
+    'Nissan 240SX Coupe 1998',
+    'Plymouth Neon Coupe 1999',
+    'Porsche Panamera Sedan 2012',
+    'Ram C/V Cargo Van Minivan 2012',
+    'Rolls-Royce Phantom Drophead Coupe Convertible 2012',
+    'Rolls-Royce Ghost Sedan 2012',
+    'Rolls-Royce Phantom Sedan 2012',
+    'Scion xD Hatchback 2012',
+    'Spyker C8 Convertible 2009',
+    'Spyker C8 Coupe 2009',
+    'Suzuki Aerio Sedan 2007',
+    'Suzuki Kizashi Sedan 2012',
+    'Suzuki SX4 Hatchback 2012',
+    'Suzuki SX4 Sedan 2012',
+    'Tesla Model S Sedan 2012',
+    'Toyota Sequoia SUV 2012',
+    'Toyota Camry Sedan 2012',
+    'Toyota Corolla Sedan 2012',
+    'Toyota 4Runner SUV 2012',
+    'Volkswagen Golf Hatchback 2012',
+    'Volkswagen Golf Hatchback 1991',
+    'Volkswagen Beetle Hatchback 2012',
+    'Volvo C30 Hatchback 2012',
+    'Volvo 240 Sedan 1993',
+    'Volvo XC90 SUV 2007',
+    'smart fortwo Convertible 2012',
+]
+templates = [
+    'a photo of a {}.',
+    'a photo of the {}.',
+    'a photo of my {}.',
+    'i love my {}!',
+    'a photo of my dirty {}.',
+    'a photo of my clean {}.',
+    'a photo of my new {}.',
+    'a photo of my old {}.',
+]
+```
+## UCF101
+```bash
+classes = [
+    'Apply Eye Makeup',
+    'Apply Lipstick',
+    'Archery',
+    'Baby Crawling',
+    'Balance Beam',
+    'Band Marching',
+    'Baseball Pitch',
+    'Basketball',
+    'Basketball Dunk',
+    'Bench Press',
+    'Biking',
+    'Billiards',
+    'Blow Dry Hair',
+    'Blowing Candles',
+    'Body Weight Squats',
+    'Bowling',
+    'Boxing Punching Bag',
+    'Boxing Speed Bag',
+    'Breast Stroke',
+    'Brushing Teeth',
+    'Clean And Jerk',
+    'Cliff Diving',
+    'Cricket Bowling',
+    'Cricket Shot',
+    'Cutting In Kitchen',
+    'Diving',
+    'Drumming',
+    'Fencing',
+    'Field Hockey Penalty',
+    'Floor Gymnastics',
+    'Frisbee Catch',
+    'Front Crawl',
+    'Golf Swing',
+    'Haircut',
+    'Hammer Throw',
+    'Hammering',
+    'Hand Stand Pushups',
+    'Handstand Walking',
+    'Head Massage',
+    'High Jump',
+    'Horse Race',
+    'Horse Riding',
+    'Hula Hoop',
+    'Ice Dancing',
+    'Javelin Throw',
+    'Juggling Balls',
+    'Jump Rope',
+    'Jumping Jack',
+    'Kayaking',
+    'Knitting',
+    'Long Jump',
+    'Lunges',
+    'Military Parade',
+    'Mixing',
+    'Mopping Floor',
+    'Nunchucks',
+    'Parallel Bars',
+    'Pizza Tossing',
+    'Playing Cello',
+    'Playing Daf',
+    'Playing Dhol',
+    'Playing Flute',
+    'Playing Guitar',
+    'Playing Piano',
+    'Playing Sitar',
+    'Playing Tabla',
+    'Playing Violin',
+    'Pole Vault',
+    'Pommel Horse',
+    'Pull Ups',
+    'Punch',
+    'Push Ups',
+    'Rafting',
+    'Rock Climbing Indoor',
+    'Rope Climbing',
+    'Rowing',
+    'Salsa Spin',
+    'Shaving Beard',
+    'Shotput',
+    'Skate Boarding',
+    'Skiing',
+    'Skijet',
+    'Sky Diving',
+    'Soccer Juggling',
+    'Soccer Penalty',
+    'Still Rings',
+    'Sumo Wrestling',
+    'Surfing',
+    'Swing',
+    'Table Tennis Shot',
+    'Tai Chi',
+    'Tennis Swing',
+    'Throw Discus',
+    'Trampoline Jumping',
+    'Typing',
+    'Uneven Bars',
+    'Volleyball Spiking',
+    'Walking With Dog',
+    'Wall Pushups',
+    'Writing On Board',
+    'Yo Yo',
+]
+templates = [
+    'a photo of a person {}.',
+    'a video of a person {}.',
+    'a example of a person {}.',
+    'a demonstration of a person {}.',
+    'a photo of the person {}.',
+    'a video of the person {}.',
+    'a example of the person {}.',
+    'a demonstration of the person {}.',
+    'a photo of a person using {}.',
+    'a video of a person using {}.',
+    'a example of a person using {}.',
+    'a demonstration of a person using {}.',
+    'a photo of the person using {}.',
+    'a video of the person using {}.',
+    'a example of the person using {}.',
+    'a demonstration of the person using {}.',
+    'a photo of a person doing {}.',
+    'a video of a person doing {}.',
+    'a example of a person doing {}.',
+    'a demonstration of a person doing {}.',
+    'a photo of the person doing {}.',
+    'a video of the person doing {}.',
+    'a example of the person doing {}.',
+    'a demonstration of the person doing {}.',
+    'a photo of a person during {}.',
+    'a video of a person during {}.',
+    'a example of a person during {}.',
+    'a demonstration of a person during {}.',
+    'a photo of the person during {}.',
+    'a video of the person during {}.',
+    'a example of the person during {}.',
+    'a demonstration of the person during {}.',
+    'a photo of a person performing {}.',
+    'a video of a person performing {}.',
+    'a example of a person performing {}.',
+    'a demonstration of a person performing {}.',
+    'a photo of the person performing {}.',
+    'a video of the person performing {}.',
+    'a example of the person performing {}.',
+    'a demonstration of the person performing {}.',
+    'a photo of a person practicing {}.',
+    'a video of a person practicing {}.',
+    'a example of a person practicing {}.',
+    'a demonstration of a person practicing {}.',
+    'a photo of the person practicing {}.',
+    'a video of the person practicing {}.',
+    'a example of the person practicing {}.',
+    'a demonstration of the person practicing {}.',
+]
+```

CLIP/data/rendered-sst2.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# The Rendered SST2 Dataset
+In the paper, we used an image classification dataset called Rendered SST2, to evaluate the model's capability on optical character recognition. To do so, we rendered the sentences in the [Standford Sentiment Treebank v2](https://nlp.stanford.edu/sentiment/treebank.html) dataset and used those as the input to the CLIP image encoder.
+The following command will download a 131MB archive countaining the images and extract into a subdirectory `rendered-sst2`:
+```bash
+wget https://openaipublic.azureedge.net/clip/data/rendered-sst2.tgz
+tar zxvf rendered-sst2.tgz
+```

CLIP/data/yfcc100m.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# The YFCC100M Subset
+In the paper, we performed a dataset ablation using a subset of the YFCC100M dataset and showed that the performance remained largely similar.
+The subset contains 14,829,396 images, about 15% of the full dataset, which have been filtered to only keep those with natural languag titles and/or descriptions in English.
+We provide the list of (line number, photo identifier, photo hash) of each image contained in this subset. These correspond to the first three columns in the dataset's metadata TSV file.
+```bash
+wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
+bunzip2 yfcc100m_subset_data.tsv.bz2
+```
+Use of the underlying media files is subject to the Creative Commons licenses chosen by their creators/uploaders. For more information about the YFCC100M dataset, visit [the official website](https://multimediacommons.wordpress.com/yfcc100m-core-dataset/).

CLIP/hubconf.py ADDED Viewed

	@@ -0,0 +1,42 @@

+from clip.clip import tokenize as _tokenize, load as _load, available_models as _available_models
+import re
+import string
+dependencies = ["torch", "torchvision", "ftfy", "regex", "tqdm"]
+# For compatibility (cannot include special characters in function name)
+model_functions = { model: re.sub(f'[{string.punctuation}]', '_', model) for model in _available_models()}
+def _create_hub_entrypoint(model):
+    def entrypoint(**kwargs):
+        return _load(model, **kwargs)
+    entrypoint.__doc__ = f"""Loads the {model} CLIP model
+        Parameters
+        ----------
+        device : Union[str, torch.device]
+            The device to put the loaded model
+        jit : bool
+            Whether to load the optimized JIT model or more hackable non-JIT model (default).
+        download_root: str
+            path to download the model files; by default, it uses "~/.cache/clip"
+        Returns
+        -------
+        model : torch.nn.Module
+            The {model} CLIP model
+        preprocess : Callable[[PIL.Image], torch.Tensor]
+            A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
+        """
+    return entrypoint
+def tokenize():
+    return _tokenize
+_entrypoints = {model_functions[model]: _create_hub_entrypoint(model) for model in _available_models()}
+globals().update(_entrypoints)

CLIP/model-card.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# Model Card: CLIP
+Inspired by [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993) and [Lessons from Archives (Jo & Gebru)](https://arxiv.org/pdf/1912.10389.pdf), we’re providing some accompanying information about the multimodal model.
+## Model Details
+The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.
+### Model Date
+January 2021
+### Model Type
+The base model uses a ResNet50 with several modifications as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
+### Model Versions
+Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
+As part of the staged release process, we have also released the RN101 model, as well as RN50x4, a RN50 scaled up 4x according to the [EfficientNet](https://arxiv.org/abs/1905.11946) scaling rule. In July 2021, we additionally released the RN50x16 and ViT-B/16 models, and in January 2022, the RN50x64 and ViT-L/14 models were released. Lastly, the ViT-L/14@336px model was released in April 2022.
+Please see the paper linked below for further details about their specification.
+### Documents
+- [Blog Post](https://openai.com/blog/clip/)
+- [CLIP Paper](https://arxiv.org/abs/2103.00020)
+## Model Use
+### Intended Use
+The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
+#### Primary intended uses
+The primary intended users of these models are AI researchers.
+We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
+### Out-of-Scope Use Cases
+**Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
+Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
+Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
+## Data
+The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users.
+### Data Mission Statement
+Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset.
+## Performance and Limitations
+### Performance
+We have evaluated the performance of CLIP on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The paper describes model performance on the following datasets:
+- Food101
+- CIFAR10
+- CIFAR100
+- Birdsnap
+- SUN397
+- Stanford Cars
+- FGVC Aircraft
+- VOC2007
+- DTD
+- Oxford-IIIT Pet dataset
+- Caltech101
+- Flowers102
+- MNIST
+- SVHN
+- IIIT5K
+- Hateful Memes
+- SST-2
+- UCF101
+- Kinetics700
+- Country211
+- CLEVR Counting
+- KITTI Distance
+- STL-10
+- RareAct
+- Flickr30
+- MSCOCO
+- ImageNet
+- ImageNet-A
+- ImageNet-R
+- ImageNet Sketch
+- ObjectNet (ImageNet Overlap)
+- Youtube-BB
+- ImageNet-Vid
+## Limitations
+CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance.
+### Bias and Fairness
+We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from [Fairface](https://arxiv.org/abs/1908.04913) into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper).
+We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks.
+## Feedback
+### Where to send questions or comments about the model
+Please use [this Google Form](https://forms.gle/Uv7afRH5dvY34ZEs9)

CLIP/notebooks/Interacting_with_CLIP.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

CLIP/notebooks/Prompt_Engineering_for_ImageNet.ipynb ADDED Viewed

	@@ -0,0 +1,1107 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Prompt Engineering for ImageNet.ipynb",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "accelerator": "GPU",
+    "widgets": {
+      "application/vnd.jupyter.widget-state+json": {
+        "66a1639713ae441d8a9b873381f9d774": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HBoxModel",
+          "state": {
+            "_view_name": "HBoxView",
+            "_dom_classes": [],
+            "_model_name": "HBoxModel",
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "box_style": "",
+            "layout": "IPY_MODEL_610b775178c645e2b4663b77cc0c67b6",
+            "_model_module": "@jupyter-widgets/controls",
+            "children": [
+              "IPY_MODEL_412dd15f0d8542f5ab2730f8616fb582",
+              "IPY_MODEL_5e6315f36b4e4eeea5c6294b024e0c97"
+            ]
+          }
+        },
+        "610b775178c645e2b4663b77cc0c67b6": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "412dd15f0d8542f5ab2730f8616fb582": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "FloatProgressModel",
+          "state": {
+            "_view_name": "ProgressView",
+            "style": "IPY_MODEL_085d5388abda4202bfa66d0c088452f8",
+            "_dom_classes": [],
+            "description": "100%",
+            "_model_name": "FloatProgressModel",
+            "bar_style": "success",
+            "max": 1000,
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "value": 1000,
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "orientation": "horizontal",
+            "min": 0,
+            "description_tooltip": null,
+            "_model_module": "@jupyter-widgets/controls",
+            "layout": "IPY_MODEL_f75124b64aa147c693c67a78f8e3a231"
+          }
+        },
+        "5e6315f36b4e4eeea5c6294b024e0c97": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "state": {
+            "_view_name": "HTMLView",
+            "style": "IPY_MODEL_6e5676a054874243b55fc6d120a07d01",
+            "_dom_classes": [],
+            "description": "",
+            "_model_name": "HTMLModel",
+            "placeholder": "",
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "value": " 1000/1000 [16:51&lt;00:00,  1.01s/it]",
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "description_tooltip": null,
+            "_model_module": "@jupyter-widgets/controls",
+            "layout": "IPY_MODEL_dc6d1416c01a4047935ee15c3fd2eb1c"
+          }
+        },
+        "085d5388abda4202bfa66d0c088452f8": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ProgressStyleModel",
+          "state": {
+            "_view_name": "StyleView",
+            "_model_name": "ProgressStyleModel",
+            "description_width": "initial",
+            "_view_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.2.0",
+            "bar_color": null,
+            "_model_module": "@jupyter-widgets/controls"
+          }
+        },
+        "f75124b64aa147c693c67a78f8e3a231": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "6e5676a054874243b55fc6d120a07d01": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "state": {
+            "_view_name": "StyleView",
+            "_model_name": "DescriptionStyleModel",
+            "description_width": "",
+            "_view_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.2.0",
+            "_model_module": "@jupyter-widgets/controls"
+          }
+        },
+        "dc6d1416c01a4047935ee15c3fd2eb1c": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "84f80a7f3e764346969a347b0f71b24e": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HBoxModel",
+          "state": {
+            "_view_name": "HBoxView",
+            "_dom_classes": [],
+            "_model_name": "HBoxModel",
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "box_style": "",
+            "layout": "IPY_MODEL_392656f01b2945f3bd7903783ed8cc96",
+            "_model_module": "@jupyter-widgets/controls",
+            "children": [
+              "IPY_MODEL_8e47a435519b4ce090879b4be2f61f99",
+              "IPY_MODEL_41b1ed6b0a9745c1a595377670b15ff4"
+            ]
+          }
+        },
+        "392656f01b2945f3bd7903783ed8cc96": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "8e47a435519b4ce090879b4be2f61f99": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "FloatProgressModel",
+          "state": {
+            "_view_name": "ProgressView",
+            "style": "IPY_MODEL_179b8ae1eb7f4a828f953e889b141725",
+            "_dom_classes": [],
+            "description": "100%",
+            "_model_name": "FloatProgressModel",
+            "bar_style": "success",
+            "max": 313,
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "value": 313,
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "orientation": "horizontal",
+            "min": 0,
+            "description_tooltip": null,
+            "_model_module": "@jupyter-widgets/controls",
+            "layout": "IPY_MODEL_d8708e8414fd44f4abd6590c9b57996f"
+          }
+        },
+        "41b1ed6b0a9745c1a595377670b15ff4": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "state": {
+            "_view_name": "HTMLView",
+            "style": "IPY_MODEL_800e30f5b4f24475a2b0046da0703631",
+            "_dom_classes": [],
+            "description": "",
+            "_model_name": "HTMLModel",
+            "placeholder": "",
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "value": " 313/313 [02:31&lt;00:00,  2.07it/s]",
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "description_tooltip": null,
+            "_model_module": "@jupyter-widgets/controls",
+            "layout": "IPY_MODEL_8764308b948745f1a677332fd21fcaf0"
+          }
+        },
+        "179b8ae1eb7f4a828f953e889b141725": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ProgressStyleModel",
+          "state": {
+            "_view_name": "StyleView",
+            "_model_name": "ProgressStyleModel",
+            "description_width": "initial",
+            "_view_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.2.0",
+            "bar_color": null,
+            "_model_module": "@jupyter-widgets/controls"
+          }
+        },
+        "d8708e8414fd44f4abd6590c9b57996f": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "800e30f5b4f24475a2b0046da0703631": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "state": {
+            "_view_name": "StyleView",
+            "_model_name": "DescriptionStyleModel",
+            "description_width": "",
+            "_view_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.2.0",
+            "_model_module": "@jupyter-widgets/controls"
+          }
+        },
+        "8764308b948745f1a677332fd21fcaf0": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        }
+      }
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "53N4k0pj_9qL"
+      },
+      "source": [
+        "# Preparation for Colab\n",
+        "\n",
+        "Make sure you're running a GPU runtime; if not, select \"GPU\" as the hardware accelerator in Runtime > Change Runtime Type in the menu. The next cells will install the `clip` package and its dependencies, and check if PyTorch 1.7.1 or later is installed."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "0BpdJkdBssk9",
+        "outputId": "41a4070f-5321-4fc4-bd4d-0b5c1f476d56"
+      },
+      "source": [
+        "! pip install ftfy regex tqdm\n",
+        "! pip install git+https://github.com/openai/CLIP.git"
+      ],
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Collecting ftfy\n",
+            "  Downloading ftfy-6.0.3.tar.gz (64 kB)\n",
+            "\u001b[?25l\r\u001b[K     |█████                           | 10 kB 14.9 MB/s eta 0:00:01\r\u001b[K     |██████████▏                     | 20 kB 18.7 MB/s eta 0:00:01\r\u001b[K     |███████████████▎                | 30 kB 9.0 MB/s eta 0:00:01\r\u001b[K     |████████████████████▍           | 40 kB 4.1 MB/s eta 0:00:01\r\u001b[K     |█████████████████████████▌      | 51 kB 4.6 MB/s eta 0:00:01\r\u001b[K     |██████████████████████████████▋ | 61 kB 4.7 MB/s eta 0:00:01\r\u001b[K     |████████████████████████████████| 64 kB 1.3 MB/s \n",
+            "\u001b[?25hRequirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (2019.12.20)\n",
+            "Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (4.41.1)\n",
+            "Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from ftfy) (0.2.5)\n",
+            "Building wheels for collected packages: ftfy\n",
+            "  Building wheel for ftfy (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+            "  Created wheel for ftfy: filename=ftfy-6.0.3-py3-none-any.whl size=41934 sha256=90ec193331444b2c4ff1cd81935e7de42065b89d304db7efac67bcfd87c27873\n",
+            "  Stored in directory: /root/.cache/pip/wheels/19/f5/38/273eb3b5e76dfd850619312f693716ac4518b498f5ffb6f56d\n",
+            "Successfully built ftfy\n",
+            "Installing collected packages: ftfy\n",
+            "Successfully installed ftfy-6.0.3\n",
+            "Collecting git+https://github.com/openai/CLIP.git\n",
+            "  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-hqnbveqi\n",
+            "  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-hqnbveqi\n",
+            "Requirement already satisfied: ftfy in /usr/local/lib/python3.7/dist-packages (from clip==1.0) (6.0.3)\n",
+            "Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from clip==1.0) (2019.12.20)\n",
+            "Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from clip==1.0) (4.41.1)\n",
+            "Requirement already satisfied: torch in /usr/local/lib/python3.7/dist-packages (from clip==1.0) (1.9.0+cu102)\n",
+            "Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (from clip==1.0) (0.10.0+cu102)\n",
+            "Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from ftfy->clip==1.0) (0.2.5)\n",
+            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch->clip==1.0) (3.7.4.3)\n",
+            "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from torchvision->clip==1.0) (1.19.5)\n",
+            "Requirement already satisfied: pillow>=5.3.0 in /usr/local/lib/python3.7/dist-packages (from torchvision->clip==1.0) (7.1.2)\n",
+            "Building wheels for collected packages: clip\n",
+            "  Building wheel for clip (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+            "  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369080 sha256=fda43d2b80cfb2b33c2d43e23ea5f53293a9a8b48d5f9e341de527f6adfbf5a3\n",
+            "  Stored in directory: /tmp/pip-ephem-wheel-cache-kmmplf44/wheels/fd/b9/c3/5b4470e35ed76e174bff77c92f91da82098d5e35fd5bc8cdac\n",
+            "Successfully built clip\n",
+            "Installing collected packages: clip\n",
+            "Successfully installed clip-1.0\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "C1hkDT38hSaP",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "e10d4f17-8fa6-4b75-a18f-f0c38990b5a3"
+      },
+      "source": [
+        "import numpy as np\n",
+        "import torch\n",
+        "import clip\n",
+        "from tqdm.notebook import tqdm\n",
+        "from pkg_resources import packaging\n",
+        "\n",
+        "print(\"Torch version:\", torch.__version__)\n"
+      ],
+      "execution_count": 2,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Torch version: 1.9.0+cu102\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "eFxgLV5HAEEw"
+      },
+      "source": [
+        "# Loading the model\n",
+        "\n",
+        "Download and instantiate a CLIP model using the `clip` module that we just installed."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "uLFS29hnhlY4",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "09abb234-693e-4efb-953f-e1847ba95758"
+      },
+      "source": [
+        "clip.available_models()"
+      ],
+      "execution_count": 3,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16']"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 3
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "cboKZocQlSYX",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "240acdd0-ca62-45db-8418-9e4ef73e8aff"
+      },
+      "source": [
+        "model, preprocess = clip.load(\"ViT-B/32\")"
+      ],
+      "execution_count": 4,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "100%|███████████████████████████████████████| 338M/338M [00:05<00:00, 63.6MiB/s]\n"
+          ],
+          "name": "stderr"
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "IBRVTY9lbGm8",
+        "outputId": "785019a1-1f40-45b0-e349-b0d4ec3173bf"
+      },
+      "source": [
+        "input_resolution = model.visual.input_resolution\n",
+        "context_length = model.context_length\n",
+        "vocab_size = model.vocab_size\n",
+        "\n",
+        "print(\"Model parameters:\", f\"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}\")\n",
+        "print(\"Input resolution:\", input_resolution)\n",
+        "print(\"Context length:\", context_length)\n",
+        "print(\"Vocab size:\", vocab_size)"
+      ],
+      "execution_count": 5,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Model parameters: 151,277,313\n",
+            "Input resolution: 224\n",
+            "Context length: 77\n",
+            "Vocab size: 49408\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "LhO3OtOmF8M4"
+      },
+      "source": [
+        "# Preparing ImageNet labels and prompts\n",
+        "\n",
+        "The following cell contains the 1,000 labels for the ImageNet dataset, followed by the text templates we'll use as \"prompt engineering\"."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "R2HbOZrqa0jF"
+      },
+      "source": [
+        "imagenet_classes = [\"tench\", \"goldfish\", \"great white shark\", \"tiger shark\", \"hammerhead shark\", \"electric ray\", \"stingray\", \"rooster\", \"hen\", \"ostrich\", \"brambling\", \"goldfinch\", \"house finch\", \"junco\", \"indigo bunting\", \"American robin\", \"bulbul\", \"jay\", \"magpie\", \"chickadee\", \"American dipper\", \"kite (bird of prey)\", \"bald eagle\", \"vulture\", \"great grey owl\", \"fire salamander\", \"smooth newt\", \"newt\", \"spotted salamander\", \"axolotl\", \"American bullfrog\", \"tree frog\", \"tailed frog\", \"loggerhead sea turtle\", \"leatherback sea turtle\", \"mud turtle\", \"terrapin\", \"box turtle\", \"banded gecko\", \"green iguana\", \"Carolina anole\", \"desert grassland whiptail lizard\", \"agama\", \"frilled-necked lizard\", \"alligator lizard\", \"Gila monster\", \"European green lizard\", \"chameleon\", \"Komodo dragon\", \"Nile crocodile\", \"American alligator\", \"triceratops\", \"worm snake\", \"ring-necked snake\", \"eastern hog-nosed snake\", \"smooth green snake\", \"kingsnake\", \"garter snake\", \"water snake\", \"vine snake\", \"night snake\", \"boa constrictor\", \"African rock python\", \"Indian cobra\", \"green mamba\", \"sea snake\", \"Saharan horned viper\", \"eastern diamondback rattlesnake\", \"sidewinder rattlesnake\", \"trilobite\", \"harvestman\", \"scorpion\", \"yellow garden spider\", \"barn spider\", \"European garden spider\", \"southern black widow\", \"tarantula\", \"wolf spider\", \"tick\", \"centipede\", \"black grouse\", \"ptarmigan\", \"ruffed grouse\", \"prairie grouse\", \"peafowl\", \"quail\", \"partridge\", \"african grey parrot\", \"macaw\", \"sulphur-crested cockatoo\", \"lorikeet\", \"coucal\", \"bee eater\", \"hornbill\", \"hummingbird\", \"jacamar\", \"toucan\", \"duck\", \"red-breasted merganser\", \"goose\", \"black swan\", \"tusker\", \"echidna\", \"platypus\", \"wallaby\", \"koala\", \"wombat\", \"jellyfish\", \"sea anemone\", \"brain coral\", \"flatworm\", \"nematode\", \"conch\", \"snail\", \"slug\", \"sea slug\", \"chiton\", \"chambered nautilus\", \"Dungeness crab\", \"rock crab\", \"fiddler crab\", \"red king crab\", \"American lobster\", \"spiny lobster\", \"crayfish\", \"hermit crab\", \"isopod\", \"white stork\", \"black stork\", \"spoonbill\", \"flamingo\", \"little blue heron\", \"great egret\", \"bittern bird\", \"crane bird\", \"limpkin\", \"common gallinule\", \"American coot\", \"bustard\", \"ruddy turnstone\", \"dunlin\", \"common redshank\", \"dowitcher\", \"oystercatcher\", \"pelican\", \"king penguin\", \"albatross\", \"grey whale\", \"killer whale\", \"dugong\", \"sea lion\", \"Chihuahua\", \"Japanese Chin\", \"Maltese\", \"Pekingese\", \"Shih Tzu\", \"King Charles Spaniel\", \"Papillon\", \"toy terrier\", \"Rhodesian Ridgeback\", \"Afghan Hound\", \"Basset Hound\", \"Beagle\", \"Bloodhound\", \"Bluetick Coonhound\", \"Black and Tan Coonhound\", \"Treeing Walker Coonhound\", \"English foxhound\", \"Redbone Coonhound\", \"borzoi\", \"Irish Wolfhound\", \"Italian Greyhound\", \"Whippet\", \"Ibizan Hound\", \"Norwegian Elkhound\", \"Otterhound\", \"Saluki\", \"Scottish Deerhound\", \"Weimaraner\", \"Staffordshire Bull Terrier\", \"American Staffordshire Terrier\", \"Bedlington Terrier\", \"Border Terrier\", \"Kerry Blue Terrier\", \"Irish Terrier\", \"Norfolk Terrier\", \"Norwich Terrier\", \"Yorkshire Terrier\", \"Wire Fox Terrier\", \"Lakeland Terrier\", \"Sealyham Terrier\", \"Airedale Terrier\", \"Cairn Terrier\", \"Australian Terrier\", \"Dandie Dinmont Terrier\", \"Boston Terrier\", \"Miniature Schnauzer\", \"Giant Schnauzer\", \"Standard Schnauzer\", \"Scottish Terrier\", \"Tibetan Terrier\", \"Australian Silky Terrier\", \"Soft-coated Wheaten Terrier\", \"West Highland White Terrier\", \"Lhasa Apso\", \"Flat-Coated Retriever\", \"Curly-coated Retriever\", \"Golden Retriever\", \"Labrador Retriever\", \"Chesapeake Bay Retriever\", \"German Shorthaired Pointer\", \"Vizsla\", \"English Setter\", \"Irish Setter\", \"Gordon Setter\", \"Brittany dog\", \"Clumber Spaniel\", \"English Springer Spaniel\", \"Welsh Springer Spaniel\", \"Cocker Spaniel\", \"Sussex Spaniel\", \"Irish Water Spaniel\", \"Kuvasz\", \"Schipperke\", \"Groenendael dog\", \"Malinois\", \"Briard\", \"Australian Kelpie\", \"Komondor\", \"Old English Sheepdog\", \"Shetland Sheepdog\", \"collie\", \"Border Collie\", \"Bouvier des Flandres dog\", \"Rottweiler\", \"German Shepherd Dog\", \"Dobermann\", \"Miniature Pinscher\", \"Greater Swiss Mountain Dog\", \"Bernese Mountain Dog\", \"Appenzeller Sennenhund\", \"Entlebucher Sennenhund\", \"Boxer\", \"Bullmastiff\", \"Tibetan Mastiff\", \"French Bulldog\", \"Great Dane\", \"St. Bernard\", \"husky\", \"Alaskan Malamute\", \"Siberian Husky\", \"Dalmatian\", \"Affenpinscher\", \"Basenji\", \"pug\", \"Leonberger\", \"Newfoundland dog\", \"Great Pyrenees dog\", \"Samoyed\", \"Pomeranian\", \"Chow Chow\", \"Keeshond\", \"brussels griffon\", \"Pembroke Welsh Corgi\", \"Cardigan Welsh Corgi\", \"Toy Poodle\", \"Miniature Poodle\", \"Standard Poodle\", \"Mexican hairless dog (xoloitzcuintli)\", \"grey wolf\", \"Alaskan tundra wolf\", \"red wolf or maned wolf\", \"coyote\", \"dingo\", \"dhole\", \"African wild dog\", \"hyena\", \"red fox\", \"kit fox\", \"Arctic fox\", \"grey fox\", \"tabby cat\", \"tiger cat\", \"Persian cat\", \"Siamese cat\", \"Egyptian Mau\", \"cougar\", \"lynx\", \"leopard\", \"snow leopard\", \"jaguar\", \"lion\", \"tiger\", \"cheetah\", \"brown bear\", \"American black bear\", \"polar bear\", \"sloth bear\", \"mongoose\", \"meerkat\", \"tiger beetle\", \"ladybug\", \"ground beetle\", \"longhorn beetle\", \"leaf beetle\", \"dung beetle\", \"rhinoceros beetle\", \"weevil\", \"fly\", \"bee\", \"ant\", \"grasshopper\", \"cricket insect\", \"stick insect\", \"cockroach\", \"praying mantis\", \"cicada\", \"leafhopper\", \"lacewing\", \"dragonfly\", \"damselfly\", \"red admiral butterfly\", \"ringlet butterfly\", \"monarch butterfly\", \"small white butterfly\", \"sulphur butterfly\", \"gossamer-winged butterfly\", \"starfish\", \"sea urchin\", \"sea cucumber\", \"cottontail rabbit\", \"hare\", \"Angora rabbit\", \"hamster\", \"porcupine\", \"fox squirrel\", \"marmot\", \"beaver\", \"guinea pig\", \"common sorrel horse\", \"zebra\", \"pig\", \"wild boar\", \"warthog\", \"hippopotamus\", \"ox\", \"water buffalo\", \"bison\", \"ram (adult male sheep)\", \"bighorn sheep\", \"Alpine ibex\", \"hartebeest\", \"impala (antelope)\", \"gazelle\", \"arabian camel\", \"llama\", \"weasel\", \"mink\", \"European polecat\", \"black-footed ferret\", \"otter\", \"skunk\", \"badger\", \"armadillo\", \"three-toed sloth\", \"orangutan\", \"gorilla\", \"chimpanzee\", \"gibbon\", \"siamang\", \"guenon\", \"patas monkey\", \"baboon\", \"macaque\", \"langur\", \"black-and-white colobus\", \"proboscis monkey\", \"marmoset\", \"white-headed capuchin\", \"howler monkey\", \"titi monkey\", \"Geoffroy's spider monkey\", \"common squirrel monkey\", \"ring-tailed lemur\", \"indri\", \"Asian elephant\", \"African bush elephant\", \"red panda\", \"giant panda\", \"snoek fish\", \"eel\", \"silver salmon\", \"rock beauty fish\", \"clownfish\", \"sturgeon\", \"gar fish\", \"lionfish\", \"pufferfish\", \"abacus\", \"abaya\", \"academic gown\", \"accordion\", \"acoustic guitar\", \"aircraft carrier\", \"airliner\", \"airship\", \"altar\", \"ambulance\", \"amphibious vehicle\", \"analog clock\", \"apiary\", \"apron\", \"trash can\", \"assault rifle\", \"backpack\", \"bakery\", \"balance beam\", \"balloon\", \"ballpoint pen\", \"Band-Aid\", \"banjo\", \"baluster / handrail\", \"barbell\", \"barber chair\", \"barbershop\", \"barn\", \"barometer\", \"barrel\", \"wheelbarrow\", \"baseball\", \"basketball\", \"bassinet\", \"bassoon\", \"swimming cap\", \"bath towel\", \"bathtub\", \"station wagon\", \"lighthouse\", \"beaker\", \"military hat (bearskin or shako)\", \"beer bottle\", \"beer glass\", \"bell tower\", \"baby bib\", \"tandem bicycle\", \"bikini\", \"ring binder\", \"binoculars\", \"birdhouse\", \"boathouse\", \"bobsleigh\", \"bolo tie\", \"poke bonnet\", \"bookcase\", \"bookstore\", \"bottle cap\", \"hunting bow\", \"bow tie\", \"brass memorial plaque\", \"bra\", \"breakwater\", \"breastplate\", \"broom\", \"bucket\", \"buckle\", \"bulletproof vest\", \"high-speed train\", \"butcher shop\", \"taxicab\", \"cauldron\", \"candle\", \"cannon\", \"canoe\", \"can opener\", \"cardigan\", \"car mirror\", \"carousel\", \"tool kit\", \"cardboard box / carton\", \"car wheel\", \"automated teller machine\", \"cassette\", \"cassette player\", \"castle\", \"catamaran\", \"CD player\", \"cello\", \"mobile phone\", \"chain\", \"chain-link fence\", \"chain mail\", \"chainsaw\", \"storage chest\", \"chiffonier\", \"bell or wind chime\", \"china cabinet\", \"Christmas stocking\", \"church\", \"movie theater\", \"cleaver\", \"cliff dwelling\", \"cloak\", \"clogs\", \"cocktail shaker\", \"coffee mug\", \"coffeemaker\", \"spiral or coil\", \"combination lock\", \"computer keyboard\", \"candy store\", \"container ship\", \"convertible\", \"corkscrew\", \"cornet\", \"cowboy boot\", \"cowboy hat\", \"cradle\", \"construction crane\", \"crash helmet\", \"crate\", \"infant bed\", \"Crock Pot\", \"croquet ball\", \"crutch\", \"cuirass\", \"dam\", \"desk\", \"desktop computer\", \"rotary dial telephone\", \"diaper\", \"digital clock\", \"digital watch\", \"dining table\", \"dishcloth\", \"dishwasher\", \"disc brake\", \"dock\", \"dog sled\", \"dome\", \"doormat\", \"drilling rig\", \"drum\", \"drumstick\", \"dumbbell\", \"Dutch oven\", \"electric fan\", \"electric guitar\", \"electric locomotive\", \"entertainment center\", \"envelope\", \"espresso machine\", \"face powder\", \"feather boa\", \"filing cabinet\", \"fireboat\", \"fire truck\", \"fire screen\", \"flagpole\", \"flute\", \"folding chair\", \"football helmet\", \"forklift\", \"fountain\", \"fountain pen\", \"four-poster bed\", \"freight car\", \"French horn\", \"frying pan\", \"fur coat\", \"garbage truck\", \"gas mask or respirator\", \"gas pump\", \"goblet\", \"go-kart\", \"golf ball\", \"golf cart\", \"gondola\", \"gong\", \"gown\", \"grand piano\", \"greenhouse\", \"radiator grille\", \"grocery store\", \"guillotine\", \"hair clip\", \"hair spray\", \"half-track\", \"hammer\", \"hamper\", \"hair dryer\", \"hand-held computer\", \"handkerchief\", \"hard disk drive\", \"harmonica\", \"harp\", \"combine harvester\", \"hatchet\", \"holster\", \"home theater\", \"honeycomb\", \"hook\", \"hoop skirt\", \"gymnastic horizontal bar\", \"horse-drawn vehicle\", \"hourglass\", \"iPod\", \"clothes iron\", \"carved pumpkin\", \"jeans\", \"jeep\", \"T-shirt\", \"jigsaw puzzle\", \"rickshaw\", \"joystick\", \"kimono\", \"knee pad\", \"knot\", \"lab coat\", \"ladle\", \"lampshade\", \"laptop computer\", \"lawn mower\", \"lens cap\", \"letter opener\", \"library\", \"lifeboat\", \"lighter\", \"limousine\", \"ocean liner\", \"lipstick\", \"slip-on shoe\", \"lotion\", \"music speaker\", \"loupe magnifying glass\", \"sawmill\", \"magnetic compass\", \"messenger bag\", \"mailbox\", \"tights\", \"one-piece bathing suit\", \"manhole cover\", \"maraca\", \"marimba\", \"mask\", \"matchstick\", \"maypole\", \"maze\", \"measuring cup\", \"medicine cabinet\", \"megalith\", \"microphone\", \"microwave oven\", \"military uniform\", \"milk can\", \"minibus\", \"miniskirt\", \"minivan\", \"missile\", \"mitten\", \"mixing bowl\", \"mobile home\", \"ford model t\", \"modem\", \"monastery\", \"monitor\", \"moped\", \"mortar and pestle\", \"graduation cap\", \"mosque\", \"mosquito net\", \"vespa\", \"mountain bike\", \"tent\", \"computer mouse\", \"mousetrap\", \"moving van\", \"muzzle\", \"metal nail\", \"neck brace\", \"necklace\", \"baby pacifier\", \"notebook computer\", \"obelisk\", \"oboe\", \"ocarina\", \"odometer\", \"oil filter\", \"pipe organ\", \"oscilloscope\", \"overskirt\", \"bullock cart\", \"oxygen mask\", \"product packet / packaging\", \"paddle\", \"paddle wheel\", \"padlock\", \"paintbrush\", \"pajamas\", \"palace\", \"pan flute\", \"paper towel\", \"parachute\", \"parallel bars\", \"park bench\", \"parking meter\", \"railroad car\", \"patio\", \"payphone\", \"pedestal\", \"pencil case\", \"pencil sharpener\", \"perfume\", \"Petri dish\", \"photocopier\", \"plectrum\", \"Pickelhaube\", \"picket fence\", \"pickup truck\", \"pier\", \"piggy bank\", \"pill bottle\", \"pillow\", \"ping-pong ball\", \"pinwheel\", \"pirate ship\", \"drink pitcher\", \"block plane\", \"planetarium\", \"plastic bag\", \"plate rack\", \"farm plow\", \"plunger\", \"Polaroid camera\", \"pole\", \"police van\", \"poncho\", \"pool table\", \"soda bottle\", \"plant pot\", \"potter's wheel\", \"power drill\", \"prayer rug\", \"printer\", \"prison\", \"missile\", \"projector\", \"hockey puck\", \"punching bag\", \"purse\", \"quill\", \"quilt\", \"race car\", \"racket\", \"radiator\", \"radio\", \"radio telescope\", \"rain barrel\", \"recreational vehicle\", \"fishing casting reel\", \"reflex camera\", \"refrigerator\", \"remote control\", \"restaurant\", \"revolver\", \"rifle\", \"rocking chair\", \"rotisserie\", \"eraser\", \"rugby ball\", \"ruler measuring stick\", \"sneaker\", \"safe\", \"safety pin\", \"salt shaker\", \"sandal\", \"sarong\", \"saxophone\", \"scabbard\", \"weighing scale\", \"school bus\", \"schooner\", \"scoreboard\", \"CRT monitor\", \"screw\", \"screwdriver\", \"seat belt\", \"sewing machine\", \"shield\", \"shoe store\", \"shoji screen / room divider\", \"shopping basket\", \"shopping cart\", \"shovel\", \"shower cap\", \"shower curtain\", \"ski\", \"balaclava ski mask\", \"sleeping bag\", \"slide rule\", \"sliding door\", \"slot machine\", \"snorkel\", \"snowmobile\", \"snowplow\", \"soap dispenser\", \"soccer ball\", \"sock\", \"solar thermal collector\", \"sombrero\", \"soup bowl\", \"keyboard space bar\", \"space heater\", \"space shuttle\", \"spatula\", \"motorboat\", \"spider web\", \"spindle\", \"sports car\", \"spotlight\", \"stage\", \"steam locomotive\", \"through arch bridge\", \"steel drum\", \"stethoscope\", \"scarf\", \"stone wall\", \"stopwatch\", \"stove\", \"strainer\", \"tram\", \"stretcher\", \"couch\", \"stupa\", \"submarine\", \"suit\", \"sundial\", \"sunglasses\", \"sunglasses\", \"sunscreen\", \"suspension bridge\", \"mop\", \"sweatshirt\", \"swim trunks / shorts\", \"swing\", \"electrical switch\", \"syringe\", \"table lamp\", \"tank\", \"tape player\", \"teapot\", \"teddy bear\", \"television\", \"tennis ball\", \"thatched roof\", \"front curtain\", \"thimble\", \"threshing machine\", \"throne\", \"tile roof\", \"toaster\", \"tobacco shop\", \"toilet seat\", \"torch\", \"totem pole\", \"tow truck\", \"toy store\", \"tractor\", \"semi-trailer truck\", \"tray\", \"trench coat\", \"tricycle\", \"trimaran\", \"tripod\", \"triumphal arch\", \"trolleybus\", \"trombone\", \"hot tub\", \"turnstile\", \"typewriter keyboard\", \"umbrella\", \"unicycle\", \"upright piano\", \"vacuum cleaner\", \"vase\", \"vaulted or arched ceiling\", \"velvet fabric\", \"vending machine\", \"vestment\", \"viaduct\", \"violin\", \"volleyball\", \"waffle iron\", \"wall clock\", \"wallet\", \"wardrobe\", \"military aircraft\", \"sink\", \"washing machine\", \"water bottle\", \"water jug\", \"water tower\", \"whiskey jug\", \"whistle\", \"hair wig\", \"window screen\", \"window shade\", \"Windsor tie\", \"wine bottle\", \"airplane wing\", \"wok\", \"wooden spoon\", \"wool\", \"split-rail fence\", \"shipwreck\", \"sailboat\", \"yurt\", \"website\", \"comic book\", \"crossword\", \"traffic or street sign\", \"traffic light\", \"dust jacket\", \"menu\", \"plate\", \"guacamole\", \"consomme\", \"hot pot\", \"trifle\", \"ice cream\", \"popsicle\", \"baguette\", \"bagel\", \"pretzel\", \"cheeseburger\", \"hot dog\", \"mashed potatoes\", \"cabbage\", \"broccoli\", \"cauliflower\", \"zucchini\", \"spaghetti squash\", \"acorn squash\", \"butternut squash\", \"cucumber\", \"artichoke\", \"bell pepper\", \"cardoon\", \"mushroom\", \"Granny Smith apple\", \"strawberry\", \"orange\", \"lemon\", \"fig\", \"pineapple\", \"banana\", \"jackfruit\", \"cherimoya (custard apple)\", \"pomegranate\", \"hay\", \"carbonara\", \"chocolate syrup\", \"dough\", \"meatloaf\", \"pizza\", \"pot pie\", \"burrito\", \"red wine\", \"espresso\", \"tea cup\", \"eggnog\", \"mountain\", \"bubble\", \"cliff\", \"coral reef\", \"geyser\", \"lakeshore\", \"promontory\", \"sandbar\", \"beach\", \"valley\", \"volcano\", \"baseball player\", \"bridegroom\", \"scuba diver\", \"rapeseed\", \"daisy\", \"yellow lady's slipper\", \"corn\", \"acorn\", \"rose hip\", \"horse chestnut seed\", \"coral fungus\", \"agaric\", \"gyromitra\", \"stinkhorn mushroom\", \"earth star fungus\", \"hen of the woods mushroom\", \"bolete\", \"corn cob\", \"toilet paper\"]"
+      ],
+      "execution_count": 6,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "eMQSCuBta2G6"
+      },
+      "source": [
+        "A subset of these class names are modified from the default ImageNet class names sourced from Anish Athalye's imagenet-simple-labels.\n",
+        "\n",
+        "These edits were made via trial and error and concentrated on the lowest performing classes according to top_1 and top_5 accuracy on the ImageNet training set for the RN50, RN101, and RN50x4 models. These tweaks improve top_1 by 1.5% on ViT-B/32 over using the default class names. Alec got bored somewhere along the way as gains started to diminish and never finished updating / tweaking the list. He also didn't revisit this with the better performing RN50x16, RN50x64, or any of the ViT models. He thinks it's likely another 0.5% to 1% top_1 could be gained from further work here. It'd be interesting to more rigorously study / understand this.\n",
+        "\n",
+        "Some examples beyond the crane/crane -> construction crane / bird crane issue mentioned in Section 3.1.4 of the paper include:\n",
+        "\n",
+        "- CLIP interprets \"nail\" as \"fingernail\" so we changed the label to \"metal nail\".\n",
+        "- ImageNet kite class refers to the bird of prey, not the flying toy, so we changed \"kite\" to \"kite (bird of prey)\"\n",
+        "- The ImageNet class for red wolf seems to include a lot of mislabeled maned wolfs so we changed \"red wolf\" to \"red wolf or maned wolf\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "toGtcd-Ji_MD",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "b6eb0753-2bee-4144-abe3-fbd23f35f555"
+      },
+      "source": [
+        "imagenet_templates = [\n",
+        "    'a bad photo of a {}.',\n",
+        "    'a photo of many {}.',\n",
+        "    'a sculpture of a {}.',\n",
+        "    'a photo of the hard to see {}.',\n",
+        "    'a low resolution photo of the {}.',\n",
+        "    'a rendering of a {}.',\n",
+        "    'graffiti of a {}.',\n",
+        "    'a bad photo of the {}.',\n",
+        "    'a cropped photo of the {}.',\n",
+        "    'a tattoo of a {}.',\n",
+        "    'the embroidered {}.',\n",
+        "    'a photo of a hard to see {}.',\n",
+        "    'a bright photo of a {}.',\n",
+        "    'a photo of a clean {}.',\n",
+        "    'a photo of a dirty {}.',\n",
+        "    'a dark photo of the {}.',\n",
+        "    'a drawing of a {}.',\n",
+        "    'a photo of my {}.',\n",
+        "    'the plastic {}.',\n",
+        "    'a photo of the cool {}.',\n",
+        "    'a close-up photo of a {}.',\n",
+        "    'a black and white photo of the {}.',\n",
+        "    'a painting of the {}.',\n",
+        "    'a painting of a {}.',\n",
+        "    'a pixelated photo of the {}.',\n",
+        "    'a sculpture of the {}.',\n",
+        "    'a bright photo of the {}.',\n",
+        "    'a cropped photo of a {}.',\n",
+        "    'a plastic {}.',\n",
+        "    'a photo of the dirty {}.',\n",
+        "    'a jpeg corrupted photo of a {}.',\n",
+        "    'a blurry photo of the {}.',\n",
+        "    'a photo of the {}.',\n",
+        "    'a good photo of the {}.',\n",
+        "    'a rendering of the {}.',\n",
+        "    'a {} in a video game.',\n",
+        "    'a photo of one {}.',\n",
+        "    'a doodle of a {}.',\n",
+        "    'a close-up photo of the {}.',\n",
+        "    'a photo of a {}.',\n",
+        "    'the origami {}.',\n",
+        "    'the {} in a video game.',\n",
+        "    'a sketch of a {}.',\n",
+        "    'a doodle of the {}.',\n",
+        "    'a origami {}.',\n",
+        "    'a low resolution photo of a {}.',\n",
+        "    'the toy {}.',\n",
+        "    'a rendition of the {}.',\n",
+        "    'a photo of the clean {}.',\n",
+        "    'a photo of a large {}.',\n",
+        "    'a rendition of a {}.',\n",
+        "    'a photo of a nice {}.',\n",
+        "    'a photo of a weird {}.',\n",
+        "    'a blurry photo of a {}.',\n",
+        "    'a cartoon {}.',\n",
+        "    'art of a {}.',\n",
+        "    'a sketch of the {}.',\n",
+        "    'a embroidered {}.',\n",
+        "    'a pixelated photo of a {}.',\n",
+        "    'itap of the {}.',\n",
+        "    'a jpeg corrupted photo of the {}.',\n",
+        "    'a good photo of a {}.',\n",
+        "    'a plushie {}.',\n",
+        "    'a photo of the nice {}.',\n",
+        "    'a photo of the small {}.',\n",
+        "    'a photo of the weird {}.',\n",
+        "    'the cartoon {}.',\n",
+        "    'art of the {}.',\n",
+        "    'a drawing of the {}.',\n",
+        "    'a photo of the large {}.',\n",
+        "    'a black and white photo of a {}.',\n",
+        "    'the plushie {}.',\n",
+        "    'a dark photo of a {}.',\n",
+        "    'itap of a {}.',\n",
+        "    'graffiti of the {}.',\n",
+        "    'a toy {}.',\n",
+        "    'itap of my {}.',\n",
+        "    'a photo of a cool {}.',\n",
+        "    'a photo of a small {}.',\n",
+        "    'a tattoo of the {}.',\n",
+        "]\n",
+        "\n",
+        "print(f\"{len(imagenet_classes)} classes, {len(imagenet_templates)} templates\")"
+      ],
+      "execution_count": 7,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "1000 classes, 80 templates\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "aRB5OzgpHwqQ"
+      },
+      "source": [
+        "A similar, intuition-guided trial and error based on the ImageNet training set was used for templates. This list is pretty haphazard and was gradually made / expanded over the course of about a year of the project and was revisited / tweaked every few months. A surprising / weird thing was adding templates intended to help ImageNet-R performance (specifying different possible renditions of an object) improved standard ImageNet accuracy too.\n",
+        "\n",
+        "After the 80 templates were \"locked\" for the paper, we ran sequential forward selection over the list of 80 templates. The search terminated after ensembling 7 templates and selected them in the order below.\n",
+        "\n",
+        "1. itap of a {}.\n",
+        "2. a bad photo of the {}.\n",
+        "3. a origami {}.\n",
+        "4. a photo of the large {}.\n",
+        "5. a {} in a video game.\n",
+        "6. art of the {}.\n",
+        "7. a photo of the small {}.\n",
+        "\n",
+        "Speculating, we think it's interesting to see different scales (large and small), a difficult view (a bad photo), and \"abstract\" versions (origami, video game, art), were all selected for, but we haven't studied this in any detail. This subset performs a bit better than the full 80 ensemble reported in the paper, especially for the smaller models."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4W8ARJVqBJXs"
+      },
+      "source": [
+        "# Loading the Images\n",
+        "\n",
+        "The ILSVRC2012 datasets are no longer available for download publicly. We instead download the ImageNet-V2 dataset by [Recht et al.](https://arxiv.org/abs/1902.10811).\n",
+        "\n",
+        "If you have the ImageNet dataset downloaded, you can replace the dataset with the official torchvision loader, e.g.:\n",
+        "\n",
+        "```python\n",
+        "images = torchvision.datasets.ImageNet(\"path/to/imagenet\", split='val', transform=preprocess)\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "moHR4UlHKsDc",
+        "outputId": "40731297-edc7-4cd0-be75-ed426c8fb005"
+      },
+      "source": [
+        "! pip install git+https://github.com/modestyachts/ImageNetV2_pytorch\n",
+        "\n",
+        "from imagenetv2_pytorch import ImageNetV2Dataset\n",
+        "\n",
+        "images = ImageNetV2Dataset(transform=preprocess)\n",
+        "loader = torch.utils.data.DataLoader(images, batch_size=32, num_workers=2)"
+      ],
+      "execution_count": 8,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Collecting git+https://github.com/modestyachts/ImageNetV2_pytorch\n",
+            "  Cloning https://github.com/modestyachts/ImageNetV2_pytorch to /tmp/pip-req-build-0kih0kn2\n",
+            "  Running command git clone -q https://github.com/modestyachts/ImageNetV2_pytorch /tmp/pip-req-build-0kih0kn2\n",
+            "Building wheels for collected packages: imagenetv2-pytorch\n",
+            "  Building wheel for imagenetv2-pytorch (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+            "  Created wheel for imagenetv2-pytorch: filename=imagenetv2_pytorch-0.1-py3-none-any.whl size=2663 sha256=ac31e0ed9c61afc5e0271eed315d3a82844a79ae64f8ce43fc1c98928cec129f\n",
+            "  Stored in directory: /tmp/pip-ephem-wheel-cache-745b5n1m/wheels/ab/ee/f4/73bce5c7f68d28ce632ef33ae87ce60aaca021eb2b3b31a6fa\n",
+            "Successfully built imagenetv2-pytorch\n",
+            "Installing collected packages: imagenetv2-pytorch\n",
+            "Successfully installed imagenetv2-pytorch-0.1\n",
+            "Dataset matched-frequency not found on disk, downloading....\n"
+          ],
+          "name": "stdout"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 1.26G/1.26G [01:02<00:00, 20.2MiB/s]\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Extracting....\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fz6D-F-Wbrtp"
+      },
+      "source": [
+        "# Creating zero-shot classifier weights"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 67,
+          "referenced_widgets": [
+            "66a1639713ae441d8a9b873381f9d774",
+            "610b775178c645e2b4663b77cc0c67b6",
+            "412dd15f0d8542f5ab2730f8616fb582",
+            "5e6315f36b4e4eeea5c6294b024e0c97",
+            "085d5388abda4202bfa66d0c088452f8",
+            "f75124b64aa147c693c67a78f8e3a231",
+            "6e5676a054874243b55fc6d120a07d01",
+            "dc6d1416c01a4047935ee15c3fd2eb1c"
+          ]
+        },
+        "id": "sRqDoz1Gbsii",
+        "outputId": "312b8ebf-3961-4903-d8cb-3b7a94cc97b6"
+      },
+      "source": [
+        "def zeroshot_classifier(classnames, templates):\n",
+        "    with torch.no_grad():\n",
+        "        zeroshot_weights = []\n",
+        "        for classname in tqdm(classnames):\n",
+        "            texts = [template.format(classname) for template in templates] #format with class\n",
+        "            texts = clip.tokenize(texts).cuda() #tokenize\n",
+        "            class_embeddings = model.encode_text(texts) #embed with text encoder\n",
+        "            class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)\n",
+        "            class_embedding = class_embeddings.mean(dim=0)\n",
+        "            class_embedding /= class_embedding.norm()\n",
+        "            zeroshot_weights.append(class_embedding)\n",
+        "        zeroshot_weights = torch.stack(zeroshot_weights, dim=1).cuda()\n",
+        "    return zeroshot_weights\n",
+        "\n",
+        "\n",
+        "zeroshot_weights = zeroshot_classifier(imagenet_classes, imagenet_templates)"
+      ],
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "66a1639713ae441d8a9b873381f9d774",
+              "version_minor": 0,
+              "version_major": 2
+            },
+            "text/plain": [
+              "HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          }
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1fZo7hG8iJP5"
+      },
+      "source": [
+        "# Zero-shot prediction"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "j4kPSZoShQxN"
+      },
+      "source": [
+        "def accuracy(output, target, topk=(1,)):\n",
+        "    pred = output.topk(max(topk), 1, True, True)[1].t()\n",
+        "    correct = pred.eq(target.view(1, -1).expand_as(pred))\n",
+        "    return [float(correct[:k].reshape(-1).float().sum(0, keepdim=True).cpu().numpy()) for k in topk]"
+      ],
+      "execution_count": 10,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 102,
+          "referenced_widgets": [
+            "84f80a7f3e764346969a347b0f71b24e",
+            "392656f01b2945f3bd7903783ed8cc96",
+            "8e47a435519b4ce090879b4be2f61f99",
+            "41b1ed6b0a9745c1a595377670b15ff4",
+            "179b8ae1eb7f4a828f953e889b141725",
+            "d8708e8414fd44f4abd6590c9b57996f",
+            "800e30f5b4f24475a2b0046da0703631",
+            "8764308b948745f1a677332fd21fcaf0"
+          ]
+        },
+        "id": "wKJ7YsdlkDXo",
+        "outputId": "ab824854-38e4-4d37-ad40-2a7ce3c5fd43"
+      },
+      "source": [
+        "with torch.no_grad():\n",
+        "    top1, top5, n = 0., 0., 0.\n",
+        "    for i, (images, target) in enumerate(tqdm(loader)):\n",
+        "        images = images.cuda()\n",
+        "        target = target.cuda()\n",
+        "        \n",
+        "        # predict\n",
+        "        image_features = model.encode_image(images)\n",
+        "        image_features /= image_features.norm(dim=-1, keepdim=True)\n",
+        "        logits = 100. * image_features @ zeroshot_weights\n",
+        "\n",
+        "        # measure accuracy\n",
+        "        acc1, acc5 = accuracy(logits, target, topk=(1, 5))\n",
+        "        top1 += acc1\n",
+        "        top5 += acc5\n",
+        "        n += images.size(0)\n",
+        "\n",
+        "top1 = (top1 / n) * 100\n",
+        "top5 = (top5 / n) * 100 \n",
+        "\n",
+        "print(f\"Top-1 accuracy: {top1:.2f}\")\n",
+        "print(f\"Top-5 accuracy: {top5:.2f}\")"
+      ],
+      "execution_count": 11,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "84f80a7f3e764346969a347b0f71b24e",
+              "version_minor": 0,
+              "version_major": 2
+            },
+            "text/plain": [
+              "HBox(children=(FloatProgress(value=0.0, max=313.0), HTML(value='')))"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          }
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "Top-1 accuracy: 55.93\n",
+            "Top-5 accuracy: 83.36\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    }
+  ]
+}

CLIP/requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+ftfy
+regex
+tqdm
+torch
+torchvision

CLIP/setup.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import os
+import pkg_resources
+from setuptools import setup, find_packages
+setup(
+    name="clip",
+    py_modules=["clip"],
+    version="1.0",
+    description="",
+    author="OpenAI",
+    packages=find_packages(exclude=["tests*"]),
+    install_requires=[
+        str(r)
+        for r in pkg_resources.parse_requirements(
+            open(os.path.join(os.path.dirname(__file__), "requirements.txt"))
+        )
+    ],
+    include_package_data=True,
+    extras_require={'dev': ['pytest']},
+)

CLIP/tests/test_consistency.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import numpy as np
+import pytest
+import torch
+from PIL import Image
+import clip
+@pytest.mark.parametrize('model_name', clip.available_models())
+def test_consistency(model_name):
+    device = "cpu"
+    jit_model, transform = clip.load(model_name, device=device, jit=True)
+    py_model, _ = clip.load(model_name, device=device, jit=False)
+    image = transform(Image.open("CLIP.png")).unsqueeze(0).to(device)
+    text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
+    with torch.no_grad():
+        logits_per_image, _ = jit_model(image, text)
+        jit_probs = logits_per_image.softmax(dim=-1).cpu().numpy()
+        logits_per_image, _ = py_model(image, text)
+        py_probs = logits_per_image.softmax(dim=-1).cpu().numpy()
+    assert np.allclose(jit_probs, py_probs, atol=0.01, rtol=0.1)

Dockerfile ADDED Viewed

	@@ -0,0 +1,6 @@

+FROM --platform=linux/amd64 python:3.9.2-slim
+WORKDIR /workspace
+#COPY requirements.txt ./
+RUN apt-get update && apt-get install -y gcc g++
+# COPY . /workspace
+CMD ["bash"]

Dockerfile~ ADDED Viewed

	@@ -0,0 +1,6 @@

+FROM --platform=linux/amd64 python:3.9-slim
+WORKDIR /workspace
+#COPY requirements.txt ./
+RUN apt update && apt install gcc
+COPY . /workspace
+CMD ["bash"]

LICENSE ADDED Viewed

	@@ -0,0 +1,33 @@

+The Clear BSD License
+Copyright (c) 2024 Yuiga Wada
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted (subject to the limitations in the disclaimer
+below) provided that the following conditions are met:
+     * Redistributions of source code must retain the above copyright notice,
+     this list of conditions and the following disclaimer.
+     * Redistributions in binary form must reproduce the above copyright
+     notice, this list of conditions and the following disclaimer in the
+     documentation and/or other materials provided with the distribution.
+     * Neither the name of the copyright holder nor the names of its
+     contributors may be used to endorse or promote products derived from this
+     software without specific prior written permission.
+NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
+THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.

README.md CHANGED Viewed

@@ -1,13 +1,11 @@
----
-title: Polos
-emoji: 🖼
-colorFrom: purple
-colorTo: red
-sdk: gradio
-sdk_version: 4.26.0
-app_file: app.py
-pinned: false
-license: bsd-3-clause-clear
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+## Get Started on M1 Mac
+```bash
+git submodule update --init --recursive
+docker build . -t polos_demo
+docker run -it -d -v `pwd`:/workspace -p 8080:8080 --platform linux/amd64 polos_demo
+docker exec -it $process_id bash
+root@28cb354f7609:~# sh install.sh
+root@28cb354f7609:~# poetry run python test.py
+root@28cb354f7609:~# poetry run streamlit run test.py --server.port 8080
+```

app.py CHANGED Viewed

@@ -1,146 +1,38 @@
-import gradio as gr
-import numpy as np
-import random
-from diffusers import DiffusionPipeline
-import torch
-device = "cuda" if torch.cuda.is_available() else "cpu"
-if torch.cuda.is_available():
-    torch.cuda.max_memory_allocated(device=device)
-    pipe = DiffusionPipeline.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
-    pipe.enable_xformers_memory_efficient_attention()
-    pipe = pipe.to(device)
-else:
-    pipe = DiffusionPipeline.from_pretrained("stabilityai/sdxl-turbo", use_safetensors=True)
-    pipe = pipe.to(device)
-MAX_SEED = np.iinfo(np.int32).max
-MAX_IMAGE_SIZE = 1024
-def infer(prompt, negative_prompt, seed, randomize_seed, width, height, guidance_scale, num_inference_steps):
-    if randomize_seed:
-        seed = random.randint(0, MAX_SEED)
-    generator = torch.Generator().manual_seed(seed)
-    image = pipe(
-        prompt = prompt,
-        negative_prompt = negative_prompt,
-        guidance_scale = guidance_scale,
-        num_inference_steps = num_inference_steps,
-        width = width,
-        height = height,
-        generator = generator
-    ).images[0]
-    return image
-examples = [
-    "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
-    "An astronaut riding a green horse",
-    "A delicious ceviche cheesecake slice",
 ]
-css="""
-#col-container {
-    margin: 0 auto;
-    max-width: 520px;
-}
-"""
-if torch.cuda.is_available():
-    power_device = "GPU"
-else:
-    power_device = "CPU"
-with gr.Blocks(css=css) as demo:
-    with gr.Column(elem_id="col-container"):
-        gr.Markdown(f"""
-        # Text-to-Image Gradio Template
-        Currently running on {power_device}.
-        """)
-        with gr.Row():
-            prompt = gr.Text(
-                label="Prompt",
-                show_label=False,
-                max_lines=1,
-                placeholder="Enter your prompt",
-                container=False,
-            )
-            run_button = gr.Button("Run", scale=0)
-        result = gr.Image(label="Result", show_label=False)
-        with gr.Accordion("Advanced Settings", open=False):
-            negative_prompt = gr.Text(
-                label="Negative prompt",
-                max_lines=1,
-                placeholder="Enter a negative prompt",
-                visible=False,
-            )
-            seed = gr.Slider(
-                label="Seed",
-                minimum=0,
-                maximum=MAX_SEED,
-                step=1,
-                value=0,
-            )
-            randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
-            with gr.Row():
-                width = gr.Slider(
-                    label="Width",
-                    minimum=256,
-                    maximum=MAX_IMAGE_SIZE,
-                    step=32,
-                    value=512,
-                )
-                height = gr.Slider(
-                    label="Height",
-                    minimum=256,
-                    maximum=MAX_IMAGE_SIZE,
-                    step=32,
-                    value=512,
-                )
-            with gr.Row():
-                guidance_scale = gr.Slider(
-                    label="Guidance scale",
-                    minimum=0.0,
-                    maximum=10.0,
-                    step=0.1,
-                    value=0.0,
-                )
-                num_inference_steps = gr.Slider(
-                    label="Number of inference steps",
-                    minimum=1,
-                    maximum=12,
-                    step=1,
-                    value=2,
-                )
-        gr.Examples(
-            examples = examples,
-            inputs = [prompt]
-        )
-    run_button.click(
-        fn = infer,
-        inputs = [prompt, negative_prompt, seed, randomize_seed, width, height, guidance_scale, num_inference_steps],
-        outputs = [result]
-    )
-demo.queue().launch()

+import streamlit as st
+from PIL import Image
+from polos.models import download_model, load_checkpoint
+@st.cache(allow_output_mutation=True)
+def load_model():
+    model_path = download_model("polos")
+    model = load_checkpoint(model_path)
+    return model
+model = load_model()
+default_image = Image.open("test.jpg").convert("RGB")
+default_refs = [
+    "there is a dog sitting on a couch with a person reaching out",
+    "a dog laying on a couch with a person",
+    'a dog is laying on a couch with a person'
 ]
+data = [
+    {
+        "img": default_image,
+        "mt": "",
+        "refs": default_refs
+    }
+]
+# Streamlitインターフェースの設定
+st.title('Polos Demo')
+# ユーザー入力のテキストフィールド
+user_input = st.text_input("Enter the input sentence:", '')
+# 入力がある場合、モデルを使用してスコアを計算
+if user_input:
+    data[0]['mt'] = user_input
+    _, scores = model.predict(data, batch_size=1, cuda=False)
+    st.write("Score:", scores)

configs/polos-trainer.yaml ADDED Viewed

	@@ -0,0 +1,46 @@

+seed: 42
+monitor: pearson
+metric_mode: max
+early_stopping: True
+patience: 1
+min_delta: 0.0
+save_top_k: 2
+save_weights_only: False
+min_epochs: 1
+max_epochs: 100
+gradient_clip_val: 1.0
+gpus: 1
+precision: 32
+batch_size: 64
+accumulate_grad_batches: 4
+loader_workers: 4
+optimizer: Adam
+learning_rate: 3.0e-05
+encoder_learning_rate: 1.0e-05
+layerwise_decay: 0.95
+nr_frozen_epochs: 100000
+scheduler: constant
+train_path: data_en/polaris/polaris_train.csv
+val_path: data_en/polaris/polaris_val.csv
+test_path: data_en/polaris/polaris_test.csv
+train_img_dir_path: data_en/polaris/images
+val_img_dir_path: data_en/polaris/images
+test_img_dir_path: data_en/polaris/images
+model: PolosEstimator
+loss: mse
+encoder_model: BERT
+# pretrained_model: princeton-nlp/sup-simcse-roberta-large
+pretrained_model: princeton-nlp/sup-simcse-roberta-base
+layer: mix
+scalar_mix_dropout: 0.1
+pool: cls
+dropout: 0.1
+activations: Tanh
+hidden_sizes: "2304,1152"
+final_activation: False

docker.sh ADDED Viewed

	@@ -0,0 +1,4 @@

+# Run the following scripts
+#docker build .
+#docker run -it -d  --platform linux/amd64 c330
+#docker exec -it 56fec4a536ceee5e76706dde62e4f7d706cf96941a27771b53d7589d882b8ce9 bash

install.sh ADDED Viewed

	@@ -0,0 +1,4 @@

+git submodule update --init --recursive
+pip install poetry
+poetry install
+#pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117

pacscore/README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+<div align="center">
+  <h1>PAC-Score: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation (CVPR 2023) </h1>
+<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a>
+[![Conference](https://img.shields.io/badge/CVPR-2023(Highlight)-f9f107.svg)](https://openaccess.thecvf.com/content/CVPR2023/html/Sarto_Positive-Augmented_Contrastive_Learning_for_Image_and_Video_Captioning_Evaluation_CVPR_2023_paper.html)
+[![Paper](https://img.shields.io/badge/Paper-arxiv.2303.12112-B31B1B.svg)](https://arxiv.org/abs/2303.12112)
+</div>
+This repository contains the reference code for the paper [Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation](https://arxiv.org/abs/2303.12112), **CVPR 2023 Highlight✨** (top 2.5% of initial submissions and top 10% of accepted papers).
+Please cite with the following BibTeX:
+```
+@inproceedings{sarto2023positive,
+  title={{Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation}},
+  author={Sarto, Sara and Barraco, Manuele and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year={2023}
+}
+```
+<p align="center">
+  <img src="images/model.png" alt="PACS" width="820" />
+</p>
+Try out the [Web demo](https://ailb-web.ing.unimore.it/pacscore), using [Gradio](https://github.com/gradio-app/gradio).
+## Environment Setup
+Clone the repository and create the ```pacs``` conda environment using the ```environment.yml``` file:
+```
+conda env create -f environment.yml
+conda activate pacs
+```
+## Loading CLIP Models and Data Preparation
+Checkpoints of different backbones are available at [this link](https://drive.google.com/drive/folders/15Da_nh7CYv8xfryIdETG6dPFSqcBiqpd?usp=sharing).
+Once you have downloaded the checkpoints, place them under the ```checkpoints/``` folder.
+| **Backbone**       | **Checkpoint**         |
+| -------------- | -------------      |
+| **CLIP ViT-B-32**  | clip_ViT-B-32.pth  |
+| **OpenCLIP ViT-L-14**  |  openClip_ViT-L-14.pth |
+An example set of inputs, including a candidate json, image directory, and references json is provided in this repository under ```example/```. The input files are formatted as follows.
+The candidates json should be a dictionary that maps from {"image_identifier": "candidate_captions"}:
+```
+{"image1": "A white dog is laying on the ground with its head on its paws .",
+  ...}
+```
+The image directory should be a directory containing the images that act as the keys in the candidates json:
+```
+images/
+├── image1.jpg
+└── image2.jpg
+```
+The references json should be a dictionary that maps from {"image_identifier": ["list", "of", "references"]}:
+```
+{"image1":
+    [
+        "A closeup of a white dog that is laying its head on its paws .",
+        "a large white dog lying on the floor .",
+        "A white dog has its head on the ground .",
+        "A white dog is resting its head on a tiled floor with its eyes open .",
+        "A white dog rests its head on the patio bricks ."
+    ]}
+```
+## Quick Start: Compute PAC-S
+Run ```python -u compute_metrics.py``` to obtain standard captioning metrics (_e.g._ BLEU, METEOR, etc.) and PAC-S.
+To compute RefPAC-S run ```python -u compute_metrics.py --compute_refpac```.
+The default backbone used is the CLIP ViT-B-32 model. To use a different backcbone (_e.g._ OpenCLIP ViT-L/14 backbone) specify in the command input ```--clip_model open_clip_ViT-L/14```.
+```
+BLEU-1: 0.6400
+BLEU-4: 0.3503
+METEOR: 0.3057
+ROUGE: 0.5012
+CIDER: 1.4918
+PAC-S: 0.8264
+RefPAC-S: 0.8393
+```
+Worse captions should get lower scores:
+```
+python -u compute_metrics.py --candidates_json example/bad_captions.json --compute_refpac
+BLEU-1: 0.4500
+BLEU-4: 0.0000
+METEOR: 0.0995
+ROUGE: 0.3268
+CIDER: 0.4259
+PAC-S: 0.5772
+RefPAC-S: 0.6357
+```
+## Human Correlation Scores
+#### Flickr8k
+The Flickr8k dataset can be downloaded at [this link](https://drive.google.com/drive/folders/1oQY8zVCmf0ZGUfsJQ_OnqP2_kw1jGIXp?usp=sharing).
+Once you have downloaded the dataset, place them under the ```datasets/flickr8k``` folder.
+#### Run Code and Expected Output
+Run ```python -u compute_correlations.py``` to compute correlation scores on **Flickr8k-Expert** and **Flickr8k-CF** datasets.
+```
+Computing correlation scores on dataset: flickr8k_expert
+BLEU-1   Kendall Tau-b: 32.175    Kendall Tau-c: 32.324
+BLEU-4   Kendall Tau-b: 30.599    Kendall Tau-c: 30.776
+METEOR   Kendall Tau-b: 41.538    Kendall Tau-c: 41.822
+ROUGE    Kendall Tau-b: 32.139    Kendall Tau-c: 32.314
+CIDER    Kendall Tau-b: 43.602    Kendall Tau-c: 43.891
+PAC-S    Kendall Tau-b: 53.919    Kendall Tau-c: 54.292
+Computing correlation scores on dataset: flickr8k_cf
+BLEU-1   Kendall Tau-b: 17.946    Kendall Tau-c: 9.256
+BLEU-4   Kendall Tau-b: 16.863    Kendall Tau-c: 8.710
+METEOR   Kendall Tau-b: 22.269    Kendall Tau-c: 11.510
+ROUGE    Kendall Tau-b: 19.903    Kendall Tau-c: 10.274
+CIDER    Kendall Tau-b: 24.619    Kendall Tau-c: 12.724
+PAC-S    Kendall Tau-b: 36.037    Kendall Tau-c: 18.628
+```
+For the reference based version of the PACScore, add ```--compute_refpac```.

pacscore/compute_correlations.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import argparse
+import torch
+import evaluation
+import scipy.stats
+from models.clip import clip
+from utils import collate_fn
+from evaluation import PACScore, RefPACScore
+from models import open_clip
+from data import Flickr8k
+from torch.utils.data import DataLoader
+_MODELS = {
+    "ViT-B/32": "checkpoints/clip_ViT-B-32.pth",
+    "open_clip_ViT-L/14": "checkpoints/openClip_ViT-L-14.pth"
+}
+def compute_correlation_scores(dataloader, model, preprocess, args):
+    gen = {}
+    gts = {}
+    human_scores = list()
+    ims_cs = list()
+    gen_cs = list()
+    gts_cs = list()
+    all_scores = dict()
+    model.eval()
+    for it, (images, candidates, references, scores) in enumerate(iter(dataloader)):
+        for i, (im_i, gts_i, gen_i, score_i) in enumerate(zip(images, references, candidates, scores)):
+            gen['%d_%d' % (it, i)] = [gen_i, ]
+            gts['%d_%d' % (it, i)] = gts_i
+            ims_cs.append(im_i)
+            gen_cs.append(gen_i)
+            gts_cs.append(gts_i)
+            human_scores.append(score_i)
+    gts = evaluation.PTBTokenizer.tokenize(gts)
+    gen = evaluation.PTBTokenizer.tokenize(gen)
+    all_scores_metrics = evaluation.get_all_metrics(gts, gen, return_per_cap=True)
+    for k, v in all_scores_metrics.items():
+        if k == 'BLEU':
+            all_scores['BLEU-1'] = v[0]
+            all_scores['BLEU-4'] = v[-1]
+        else:
+            all_scores[k] = v
+    # PAC-S
+    _, pac_scores, candidate_feats, len_candidates = PACScore(model, preprocess, ims_cs, gen_cs, device, w=2.0)
+    all_scores['PAC-S'] = pac_scores
+    # RefPAC-S
+    if args.compute_refpac:
+        _, per_instance_text_text = RefPACScore(model, gts_cs, candidate_feats, device, torch.tensor(len_candidates))
+        refpac_scores = 2 * pac_scores * per_instance_text_text / (pac_scores + per_instance_text_text)
+        all_scores['RefPAC-S'] = refpac_scores
+    for k, v in all_scores.items():
+        kendalltau_b = 100 * scipy.stats.kendalltau(v, human_scores, variant='b')[0]
+        kendalltau_c = 100 * scipy.stats.kendalltau(v, human_scores, variant='c')[0]
+        print('%s \t Kendall Tau-b: %.3f \t  Kendall Tau-c: %.3f'
+              % (k, kendalltau_b, kendalltau_c))
+def compute_scores(model, preprocess, args):
+    args.datasets = ['flickr8k_expert', 'flickr8k_cf']
+    args.batch_size_compute_score = 10
+    for d in args.datasets:
+        print("Computing correlation scores on dataset: " + d)
+        if d == 'flickr8k_expert':
+            dataset = Flickr8k(json_file='flickr8k.json')
+            dataloader = DataLoader(dataset, batch_size=args.batch_size_compute_score, shuffle=False, collate_fn=collate_fn)
+        elif d == 'flickr8k_cf':
+            dataset = Flickr8k(json_file='crowdflower_flickr8k.json')
+            dataloader = DataLoader(dataset, batch_size=args.batch_size_compute_score, shuffle=False, collate_fn=collate_fn)
+        compute_correlation_scores(dataloader, model, preprocess, args)
+if __name__ == '__main__':
+    # Argument parsing
+    parser = argparse.ArgumentParser(description='PAC-S evaluation')
+    parser.add_argument('--clip_model', type=str, default='ViT-B/32',
+                    choices=['ViT-B/32', 'open_clip_ViT-L/14'])
+    parser.add_argument('--compute_refpac', action='store_true')
+    args = parser.parse_args()
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    if args.clip_model.startswith('open_clip'):
+        print("Using Open CLIP Model: " + args.clip_model)
+        model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
+    else:
+        print("Using CLIP Model: " + args.clip_model)
+        model, preprocess = clip.load(args.clip_model, device=device)
+    model = model.to(device)
+    model = model.float()
+    checkpoint = torch.load(_MODELS[args.clip_model])
+    model.load_state_dict(checkpoint['state_dict'])
+    model.eval()
+    compute_scores(model, preprocess, args)

pacscore/compute_metrics.py ADDED Viewed

	@@ -0,0 +1,110 @@

+import os
+import argparse
+import torch
+import json
+import evaluation
+import numpy as np
+from models.clip import clip
+from evaluation import PACScore, RefPACScore
+from models import open_clip
+_MODELS = {
+    "ViT-B/32": "checkpoints/clip_ViT-B-32.pth",
+    "open_clip_ViT-L/14": "checkpoints/openClip_ViT-L-14.pth"
+}
+def compute_scores(model, preprocess, image_ids, candidates, references, args):
+    gen = {}
+    gts = {}
+    ims_cs = list()
+    gen_cs = list()
+    gts_cs = list()
+    all_scores = dict()
+    model.eval()
+    for i, (im_i, gts_i, gen_i) in enumerate(zip(image_ids, references, candidates)):
+        gen['%d' % (i)] = [gen_i, ]
+        gts['%d' % (i)] = gts_i
+        ims_cs.append(im_i)
+        gen_cs.append(gen_i)
+        gts_cs.append(gts_i)
+    gts = evaluation.PTBTokenizer.tokenize(gts)
+    gen = evaluation.PTBTokenizer.tokenize(gen)
+    all_scores_metrics = evaluation.get_all_metrics(gts, gen)
+    for k, v in all_scores_metrics.items():
+        if k == 'BLEU':
+            all_scores['BLEU-1'] = v[0]
+            all_scores['BLEU-4'] = v[-1]
+        else:
+            all_scores[k] = v
+    # PAC-S
+    _, pac_scores, candidate_feats, len_candidates = PACScore(
+        model, preprocess, ims_cs, gen_cs, device, w=2.0)
+    all_scores['PAC-S'] = np.mean(pac_scores)
+    # RefPAC-S
+    if args.compute_refpac:
+        _, per_instance_text_text = RefPACScore(
+            model, gts_cs, candidate_feats, device, torch.tensor(len_candidates))
+        refpac_scores = 2 * pac_scores * per_instance_text_text / \
+            (pac_scores + per_instance_text_text)
+        all_scores['RefPAC-S'] = np.mean(refpac_scores)
+    return all_scores
+if __name__ == '__main__':
+    # Argument parsing
+    parser = argparse.ArgumentParser(description='PAC-S evaluation')
+    parser.add_argument('--clip_model', type=str, default='ViT-B/32',
+                        choices=['ViT-B/32', 'open_clip_ViT-L/14'])
+    parser.add_argument('--image_dir', type=str, default='example/images')
+    parser.add_argument('--candidates_json', type=str,
+                        default='example/good_captions.json')
+    parser.add_argument('--references_json', type=str, default='example/refs.json')
+    parser.add_argument('--compute_refpac', action='store_true')
+    args = parser.parse_args()
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    image_ids = [img_id for img_id in os.listdir(args.image_dir)]
+    with open(args.candidates_json) as f:
+        candidates = json.load(f)
+    candidates = [candidates[cid.split('.')[0]] for cid in image_ids]
+    with open(args.references_json) as f:
+        references = json.load(f)
+        references = [references[cid.split('.')[0]] for cid in image_ids]
+    image_ids = [os.path.join(args.image_dir, img_id) for img_id in image_ids]
+    if args.clip_model.startswith('open_clip'):
+        print("Using Open CLIP Model: " + args.clip_model)
+        model, _, preprocess = open_clip.create_model_and_transforms(
+            'ViT-L-14', pretrained='laion2b_s32b_b82k')
+    else:
+        print("Using CLIP Model: " + args.clip_model)
+        model, preprocess = clip.load(args.clip_model, device=device)
+    model = model.to(device)
+    model = model.float()
+    checkpoint = torch.load(_MODELS[args.clip_model])
+    model.load_state_dict(checkpoint['state_dict'])
+    model.eval()
+    all_scores = compute_scores(
+        model, preprocess, image_ids, candidates, references, args)
+    for k, v in all_scores.items():
+        print('%s: %.4f' % (k, v))

pacscore/data/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from .dataset import *
+from torch.utils.data import DataLoader as TorchDataLoader
+class DataLoader(TorchDataLoader):
+    def __init__(self, dataset, *args, **kwargs):
+        super(DataLoader, self).__init__(dataset, *args, collate_fn=dataset.collate_fn(), **kwargs)

pacscore/data/dataset.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import torch
+import os
+from PIL import Image
+import json
+import numpy as np
+import torch
+class Flickr8k(torch.utils.data.Dataset):
+    def __init__(self, json_file, root='datasets/flickr8k/',
+                 transform=None, load_images=False):
+        self.im_folder = os.path.join(root, 'images')
+        self.transform = transform
+        self.load_images = load_images
+        with open(os.path.join(root, json_file)) as fp:
+            data = json.load(fp)
+        self.data = list()
+        for i in data:
+            for human_judgement in data[i]['human_judgement']:
+                if np.isnan(human_judgement['rating']):
+                    print('NaN')
+                    continue
+                d = {
+                    'image': data[i]['image_path'].split('/')[-1],
+                    'references': [' '.join(gt.split()) for gt in data[i]['ground_truth']],
+                    'candidate': ' '.join(human_judgement['caption'].split()),
+                    'human_score': human_judgement['rating']
+                }
+                self.data.append(d)
+    def get_image(self, filename):
+        img = Image.open(os.path.join(self.im_folder, filename)).convert('RGB')
+        if self.transform:
+            img = self.transform(img)
+        return img
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        im_idx = self.data[idx]['image']
+        candidate = self.data[idx]['candidate']
+        references = self.data[idx]['references']
+        score = self.data[idx]['human_score']
+        if self.load_images:
+            im = self.get_image(im_idx)
+        else:
+            im = os.path.join(self.im_folder, im_idx)
+        return im, candidate, references, score

pacscore/data/tokenizer/__init__.py ADDED Viewed

File without changes

pacscore/data/tokenizer/bpe_simple_vocab_16e6.txt.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
+size 1356917

pacscore/data/tokenizer/simple_tokenizer.py ADDED Viewed

	@@ -0,0 +1,144 @@

+import gzip
+import html
+import os
+from functools import lru_cache
+import ftfy
+import regex as re
+@lru_cache()
+def default_bpe():
+    return os.path.join(os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz")
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+def whitespace_clean(text):
+    text = re.sub(r'\s+', ' ', text)
+    text = text.strip()
+    return text
+class SimpleTokenizer(object):
+    def __init__(self, bpe_path: str = default_bpe()):
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        merges = gzip.open(bpe_path).read().decode("utf-8").split('\n')
+        merges = merges[1:49152-256-2+1]
+        merges = [tuple(merge.split()) for merge in merges]
+        vocab = list(bytes_to_unicode().values())
+        vocab = vocab + [v+'</w>' for v in vocab]
+        for merge in merges:
+            vocab.append(''.join(merge))
+        vocab.extend(['<|startoftext|>', '<|endoftext|>'])
+        self.encoder = dict(zip(vocab, range(len(vocab))))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
+        self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE)
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+    @property
+    def eos_idx(self):
+        return self.encoder['<|endoftext|>']
+    @property
+    def bos_idx(self):
+        return self.encoder['<|startoftext|>']
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + ( token[-1] + '</w>',)
+        pairs = get_pairs(word)
+        if not pairs:
+            return token+'</w>'
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+    def encode(self, text):
+        bpe_tokens = []
+        text = whitespace_clean(basic_clean(text)).lower()
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors="replace").replace('</w>', ' ')
+        return text

pacscore/environment.yml ADDED Viewed

	@@ -0,0 +1,92 @@

+name: pacs
+channels:
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - blas=1.0=mkl
+  - brotlipy=0.7.0=py39h27cfd23_1003
+  - ca-certificates=2023.01.10=h06a4308_0
+  - certifi=2022.12.7=py39h06a4308_0
+  - cffi=1.15.1=py39h5eee18b_3
+  - charset-normalizer=2.0.4=pyhd3eb1b0_0
+  - cryptography=38.0.4=py39h9ce1e76_0
+  - cudatoolkit=11.3.1=h2bc3f7f_2
+  - cudnn=8.2.1=cuda11.3_0
+  - cupti=11.3.1=0
+  - flit-core=3.6.0=pyhd3eb1b0_0
+  - freetype=2.12.1=h4a9f257_0
+  - future=0.18.2=py39h06a4308_1
+  - giflib=5.2.1=h5eee18b_1
+  - idna=3.4=py39h06a4308_0
+  - intel-openmp=2021.4.0=h06a4308_3561
+  - jpeg=9e=h7f8727e_0
+  - lcms2=2.12=h3be6417_0
+  - ld_impl_linux-64=2.38=h1181459_1
+  - lerc=3.0=h295c915_0
+  - libdeflate=1.8=h7f8727e_5
+  - libffi=3.4.2=h6a678d5_6
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libpng=1.6.37=hbc83047_0
+  - libprotobuf=3.20.1=h4ff587b_0
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - libtiff=4.5.0=hecacb30_0
+  - libwebp=1.2.4=h11a3e52_0
+  - libwebp-base=1.2.4=h5eee18b_0
+  - lz4-c=1.9.4=h6a678d5_0
+  - magma=2.7.0=h8db6258_0
+  - mkl=2021.4.0=h06a4308_640
+  - mkl-service=2.4.0=py39h7f8727e_0
+  - mkl_fft=1.3.1=py39hd3c417c_0
+  - mkl_random=1.2.2=py39h51133e4_0
+  - ncurses=6.4=h6a678d5_0
+  - ninja=1.10.2=h06a4308_5
+  - ninja-base=1.10.2=hd09550d_5
+  - numpy=1.23.5=py39h14f4228_0
+  - numpy-base=1.23.5=py39h31eccc5_0
+  - openssl=1.1.1s=h7f8727e_0
+  - pillow=9.3.0=py39hace64e9_1
+  - pip=22.3.1=py39h06a4308_0
+  - pycparser=2.21=pyhd3eb1b0_0
+  - pyopenssl=22.0.0=pyhd3eb1b0_0
+  - pysocks=1.7.1=py39h06a4308_0
+  - python=3.9.16=h7a1cb2a_0
+  - pytorch=1.12.1=gpu_cuda113py39h19ae3d8_1
+  - pyyaml=6.0=py39h5eee18b_1
+  - readline=8.2=h5eee18b_0
+  - requests=2.28.1=py39h06a4308_0
+  - setuptools=65.6.3=py39h06a4308_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.40.1=h5082296_0
+  - tk=8.6.12=h1ccaba5_0
+  - torchvision=0.13.1=cpu_py39h164cc8f_0
+  - typing-extensions=4.4.0=py39h06a4308_0
+  - typing_extensions=4.4.0=py39h06a4308_0
+  - tzdata=2022g=h04d1e81_0
+  - urllib3=1.26.14=py39h06a4308_0
+  - wheel=0.37.1=pyhd3eb1b0_0
+  - xz=5.2.10=h5eee18b_1
+  - yaml=0.2.5=h7b6447c_0
+  - zlib=1.2.13=h5eee18b_0
+  - zstd=1.5.2=ha4553b6_0
+  - pip:
+    - contourpy==1.0.7
+    - cycler==0.11.0
+    - filelock==3.9.0
+    - fonttools==4.38.0
+    - ftfy==6.1.1
+    - huggingface-hub==0.12.0
+    - kiwisolver==1.4.4
+    - matplotlib==3.6.3
+    - packaging==23.0
+    - psutil==5.9.4
+    - pycocoevalcap==1.2
+    - pycocotools==2.0.6
+    - pyparsing==3.0.9
+    - python-dateutil==2.8.2
+    - python-pidfile==3.0.0
+    - regex==2022.10.31
+    - scipy==1.10.0
+    - tqdm==4.64.1
+    - wcwidth==0.2.6

pacscore/evaluation/__init__.py ADDED Viewed

	@@ -0,0 +1,44 @@

+'''
+Automatic generation evaluation metrics wrapper
+The most useful function here is
+get_all_metrics(refs, cands)
+'''
+from .pac_score import PACScore, RefPACScore
+from .tokenizer import PTBTokenizer
+from pycocoevalcap.meteor.meteor import Meteor
+from pycocoevalcap.bleu.bleu import Bleu
+from pycocoevalcap.cider.cider import Cider
+from pycocoevalcap.rouge.rouge import Rouge
+from pycocoevalcap.spice.spice import Spice
+def get_all_metrics(refs, cands, return_per_cap=False):
+    metrics = []
+    names = []
+    pycoco_eval_cap_scorers = [(Bleu(4), 'BLEU'),
+                               (Meteor(), 'METEOR'),
+                               (Rouge(), 'ROUGE'),
+                               (Cider(), 'CIDER'),
+                            #    (Spice(), 'SPICE')
+                               ]
+    for scorer, name in pycoco_eval_cap_scorers:
+        overall, per_cap = pycoco_eval(scorer, refs, cands)
+        if return_per_cap:
+            metrics.append(per_cap)
+        else:
+            metrics.append(overall)
+        names.append(name)
+    metrics = dict(zip(names, metrics))
+    return metrics
+def pycoco_eval(scorer, refs, cands):
+    '''
+    scorer is assumed to have a compute_score function.
+    refs is a list of lists of strings
+    cands is a list of predictions
+    '''
+    average_score, scores = scorer.compute_score(refs, cands)
+    return average_score, scores

pacscore/evaluation/pac_score/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .pac_score import PACScore, RefPACScore

pacscore/evaluation/pac_score/pac_score.py ADDED Viewed

	@@ -0,0 +1,133 @@

+from PIL import Image
+from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
+import torch
+import tqdm
+import numpy as np
+import collections
+from models import clip
+class CapDataset(torch.utils.data.Dataset):
+    def __init__(self, data, prefix='A photo depicts'):
+        self.data = data
+        self.prefix = prefix
+        if self.prefix[-1] != ' ':
+            self.prefix += ' '
+    def __getitem__(self, idx):
+        c_data = self.data[idx]
+        c_data = clip.tokenize(self.prefix + c_data, truncate=True).squeeze()
+        return {'caption': c_data}
+    def __len__(self):
+        return len(self.data)
+class ImageDataset(torch.utils.data.Dataset):
+    def __init__(self, data, transform=None):
+        self.data = data
+        if transform:
+            self.preprocess = transform
+        else:
+            self.preprocess = self._transform_test(224)
+    def _transform_test(self, n_px):
+        return Compose([
+            Resize(n_px, interpolation=Image.BICUBIC),
+            CenterCrop(n_px),
+            lambda image: image.convert("RGB"),
+            ToTensor(),
+            Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
+        ])
+    def __getitem__(self, idx):
+        c_data = self.data[idx]
+        image = Image.open(c_data)
+        image = self.preprocess(image)
+        return {'image': image}
+    def __len__(self):
+        return len(self.data)
+def extract_all_captions(captions, model, device, batch_size=256, num_workers=6):
+    data = torch.utils.data.DataLoader(CapDataset(captions), batch_size=batch_size, num_workers=num_workers,
+                                       shuffle=False)
+    all_text_features = []
+    with torch.no_grad():
+        for b in tqdm.tqdm(data):
+            b = b['caption'].to(device)
+            all_text_features.append(model.encode_text(b).detach().cpu().numpy())
+    all_text_features = np.vstack(all_text_features)
+    return all_text_features
+def extract_all_images(images, model, transform, device, batch_size=64, num_workers=6):
+    data = torch.utils.data.DataLoader(ImageDataset(images, transform), batch_size=batch_size, num_workers=num_workers,
+                                       shuffle=False)
+    all_image_features = []
+    with torch.no_grad():
+        for b in tqdm.tqdm(data):
+            b = b['image'].to(device)
+            all_image_features.append(model.encode_image(b).detach().cpu().numpy())
+    all_image_features = np.vstack(all_image_features)
+    return all_image_features
+def PACScore(model, transform, images, candidates, device, w=2.0):
+    '''
+    compute the unreferenced PAC score.
+    '''
+    len_candidates = [len(c.split()) for c in candidates]
+    if isinstance(images, list):
+        # extracting image features
+        images = extract_all_images(images, model, transform, device)
+    candidates = extract_all_captions(candidates, model, device)
+    images = images / np.sqrt(np.sum(images ** 2, axis=1, keepdims=True))
+    candidates = candidates / np.sqrt(np.sum(candidates ** 2, axis=1, keepdims=True))
+    per = w * np.clip(np.sum(images * candidates, axis=1), 0, None)
+    return np.mean(per), per, candidates, len_candidates
+def RefPACScore(model, references, candidates, device, len_candidates):
+    '''
+    compute the RefPAC score, extracting only the reference captions.
+    '''
+    if isinstance(candidates, list):
+        candidates = extract_all_captions(candidates, model, device)
+    len_references = []
+    flattened_refs = []
+    flattened_refs_idxs = []
+    for idx, refs in enumerate(references):
+        len_r = [len(r.split()) for r in refs]
+        len_references.append(len_r)
+        flattened_refs.extend(refs)
+        flattened_refs_idxs.extend([idx for _ in refs])
+    flattened_refs = extract_all_captions(flattened_refs, model, device)
+    candidates = candidates / np.sqrt(np.sum(candidates ** 2, axis=1, keepdims=True))
+    flattened_refs = flattened_refs / np.sqrt(np.sum(flattened_refs ** 2, axis=1, keepdims=True))
+    cand_idx2refs = collections.defaultdict(list)
+    for ref_feats, cand_idx in zip(flattened_refs, flattened_refs_idxs):
+        cand_idx2refs[cand_idx].append(ref_feats)
+    assert len(cand_idx2refs) == len(candidates)
+    cand_idx2refs = {k: np.vstack(v) for k, v in cand_idx2refs.items()}
+    per = []
+    for c_idx, (cand, l_ref, l_cand) in enumerate(zip(candidates, len_references, len_candidates)):
+        cur_refs = cand_idx2refs[c_idx]
+        all_sims = cand.dot(cur_refs.transpose())
+        per.append(np.max(all_sims))
+    return np.mean(per), per

pacscore/evaluation/tokenizer.py ADDED Viewed

	@@ -0,0 +1,63 @@

+#!/usr/bin/env python
+#
+# File Name : ptbtokenizer.py
+#
+# Description : Do the PTB Tokenization and remove punctuations.
+#
+# Creation Date : 29-12-2014
+# Last Modified : Thu Mar 19 09:53:35 2015
+# Authors : Hao Fang <[email protected]> and Tsung-Yi Lin <[email protected]>
+import os
+import subprocess
+import tempfile
+class PTBTokenizer(object):
+    """Python wrapper of Stanford PTBTokenizer"""
+    corenlp_jar = 'stanford-corenlp-3.4.1.jar'
+    punctuations = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \
+                    ".", "?", "!", ",", ":", "-", "--", "...", ";"]
+    @classmethod
+    def tokenize(cls, corpus):
+        cmd = ['java', '-cp', cls.corenlp_jar, \
+                'edu.stanford.nlp.process.PTBTokenizer', \
+                '-preserveLines', '-lowerCase']
+        if isinstance(corpus, list) or isinstance(corpus, tuple):
+            if isinstance(corpus[0], list) or isinstance(corpus[0], tuple):
+                corpus = {i:c for i, c in enumerate(corpus)}
+            else:
+                corpus = {i: [c, ] for i, c in enumerate(corpus)}
+        # prepare data for PTB Tokenizer
+        tokenized_corpus = {}
+        image_id = [k for k, v in list(corpus.items()) for _ in range(len(v))]
+        sentences = '\n'.join([c.replace('\n', ' ') for k, v in corpus.items() for c in v])
+        # save sentences to temporary file
+        path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__))
+        tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname)
+        tmp_file.write(sentences.encode())
+        tmp_file.close()
+        # tokenize sentence
+        cmd.append(os.path.basename(tmp_file.name))
+        p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \
+                stdout=subprocess.PIPE, stderr=open(os.devnull, 'w'))
+        token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0]
+        token_lines = token_lines.decode()
+        lines = token_lines.split('\n')
+        # remove temp file
+        os.remove(tmp_file.name)
+        # create dictionary for tokenized captions
+        for k, line in zip(image_id, lines):
+            if not k in tokenized_corpus:
+                tokenized_corpus[k] = []
+            tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
+                                          if w not in cls.punctuations])
+            tokenized_corpus[k].append(tokenized_caption)
+        return tokenized_corpus

pacscore/example/bad_captions.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "image1": "A child is preparing to slide down a piece of playground equipment .",
+    "image2": "A black horse is jumping in the grass ."
+}

pacscore/example/good_captions.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "image1": "A white dog is laying on the ground with its head on its paws .",
+    "image2": "'Two beige dogs are playing in the grass near a doghouse .'"
+}

pacscore/example/images/image1.jpg ADDED Viewed

pacscore/example/images/image2.jpg ADDED Viewed

pacscore/example/refs.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "image1":
+    [
+        "A closeup of a white dog that is laying its head on its paws .",
+        "a large white dog lying on the floor .",
+        "A white dog has its head on the ground .",
+        "A white dog is resting its head on a tiled floor with its eyes open .",
+        "A white dog rests its head on the patio bricks ."
+    ],
+    "image2":
+    [
+        "a black dog jumping to catch a rope toy .",
+        "A black dog playing fetch with a ball of rope .",
+        "A black dog pounces to get a rope toy .",
+        "A black dog running after his rope toy .",
+        "A large black dog is playing in a grassy yard ."
+    ]
+}

pacscore/images/model.png ADDED Viewed

pacscore/models/__init__.py ADDED Viewed

File without changes