clip-japanese-base

This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.

How to use

  1. Install packages
pip install pillow requests sentencepiece transformers torch timm
  1. Run
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

Model architecture

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

Result

Model Image Encoder Params Text Encoder params STAIR Captions (R@1) Recruit Datasets (acc@1) ImageNet-1K (acc@1)
Ours 86M(Eva02-B) 100M(BERT) 0.30 0.89 0.58
Stable-ja-clip 307M(ViT-L) 100M(BERT) 0.24 0.77 0.68
Rinna-ja-clip 86M(ViT-B) 100M(BERT) 0.13 0.54 0.56
Laion-clip 632M(ViT-H) 561M(XLM-RoBERTa) 0.30 0.83 0.58
Hakuhodo-ja-clip 632M(ViT-H) 100M(BERT) 0.21 0.82 0.46

Licenses

The Apache License, Version 2.0

Citation

@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}
Downloads last month
14,580
Safetensors
Model size
197M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.