File size: 4,384 Bytes
a0fd4bb 071945c a0fd4bb 071945c a0fd4bb 071945c b7f8796 071945c b7f8796 071945c b7f8796 071945c ff0d639 071945c de9cae4 071945c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
language: ja
license: apache-2.0
tags:
- clip
- japanese-clip
pipeline_tag: feature-extraction
---
# clip-japanese-base
This is a Japanese [CLIP (Contrastive Language-Image Pre-training)](https://arxiv.org/abs/2103.00020) model developed by [LY Corporation](https://www.lycorp.co.jp/en/). This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.
## How to use
1. Install packages
```
pip install pillow requests sentencepiece transformers torch timm
```
2. Run
```python
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [[1., 0., 0.]]
```
## Model architecture
The model uses an [Eva02-B](https://huggingface.co/timm/eva02_base_patch16_clip_224.merged2b_s8b_b131k) Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16).
## Evaluation
### Dataset
- [STAIR Captions](http://captions.stair.center/) (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
- [Recruit Datasets](https://huggingface.co/datasets/recruit-jp/japanese-image-classification-evaluation-dataset) for image classification.
- [ImageNet-1K](https://www.image-net.org/download.php) for image classification. We translated all classnames into Japanese. The classnames and templates can be found in [ja-imagenet-1k-classnames.txt](https://huggingface.co/line-corporation/clip-japanese-base/blob/main/ja-imagenet-1k-classnames.txt) and [ja-imagenet-1k-templates.txt](https://huggingface.co/line-corporation/clip-japanese-base/blob/main/ja-imagenet-1k-templates.txt).
### Result
| Model | Image Encoder Params | Text Encoder params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1) |
|-------------------|----------------------|---------------------|----------------|------------------|-----------------|
| [Ours](https://huggingface.co/line-corporation/clip-japanese-base) | 86M(Eva02-B) | 100M(BERT) | **0.30** | **0.89** | 0.58 |
| [Stable-ja-clip](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) | 307M(ViT-L) | 100M(BERT) | 0.24 | 0.77 | **0.68** |
| [Rinna-ja-clip](https://huggingface.co/rinna/japanese-clip-vit-b-16) | 86M(ViT-B) | 100M(BERT) | 0.13 | 0.54 | 0.56 |
| [Laion-clip](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) | 632M(ViT-H) | 561M(XLM-RoBERTa) | **0.30** | 0.83 | 0.58 |
| [Hakuhodo-ja-clip](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-wider) | 632M(ViT-H) | 100M(BERT) | 0.21 | 0.82 | 0.46 |
## Licenses
[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## Citation
```
@misc{clip-japanese-base,
title = {CLIP Japanese Base},
author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
url = {https://huggingface.co/line-corporation/clip-japanese-base},
}
``` |