llm-jp
/

llm-jp-clip-vit-large-patch14

@@ -8,7 +8,7 @@ pipeline_tag: zero-shot-image-classification
 license:
 - apache-2.0
 datasets:
-- laion/relaion2B-en-research-safe
 language:
 - ja
 ---
@@ -16,7 +16,7 @@ language:
 # Model Details
-A CLIP ViT-L/14 model trained using [OpenCLIP](https://github.com/mlfoundations/open_clip) with the Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
 The total number of parameters of this model is 467M.
@@ -32,8 +32,8 @@ $ pip install open_clip_torch
 ```python
 import open_clip
-model, preprocess = open_clip.create_model_from_pretrained('hf-hub:speed/llm-jp-roberta-ViT-L-14-relaion-1.5B-lr5e-4-bs8k-accum4-20241218-epoch90')
-tokenizer = open_clip.get_tokenizer('hf-hub:speed/llm-jp-roberta-ViT-L-14-relaion-1.5B-lr5e-4-bs8k-accum4-20241218-epoch90')
 import torch
 from PIL import Image
@@ -70,24 +70,47 @@ Reference:
 ## Training Data
-We used a Japanese-translated version of the relaion2B-en-research-safe dataset.
-The translation was performed using gemma-2-9b-it.
 Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).
 # Evaluation
 Evaluation Code: https://github.com/llm-jp/clip-eval
-TODO:
 # LICENSE
 [The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-Please also see Gemma Terms of Use (https://ai.google.dev/gemma/terms) as the training data is translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
 # Citation
 Bibtex:
 ```
-TODO:
 ```

 license:
 - apache-2.0
 datasets:
+- llm-jp/relaion2B-en-research-safe-japanese-translation
 language:
 - ja
 ---
 # Model Details
+Japanese CLIP model trained with [OpenCLIP](https://github.com/mlfoundations/open_clip) on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation), a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).
 The total number of parameters of this model is 467M.
 ```python
 import open_clip
+model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
+tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
 import torch
 from PIL import Image
 ## Training Data
+This model is trained on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation).
 Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).
 # Evaluation
 Evaluation Code: https://github.com/llm-jp/clip-eval
+**Table:** Performance of each model in zero-shot image classification and image-text retrieval tasks. **Bold** indicates first place, and _underline_ indicates second place.
+| Model                        | Params (M) | ImageNet | Recruit | CIFAR10 | CIFAR100 | Food101 | Caltech101 | XM3600 I → T | XM3600 T → I | Avg.  |
+|-----------------------------|-------------|----------|---------|---------|----------|---------|------------|-------------|-------------|------|
+| **Japanese CLIP**           |             |          |         |         |          |         |            |             |             |      |
+| [Rinna ViT-B/16](https://huggingface.co/rinna/japanese-clip-vit-b-16)              | 196         | 50.6     | 39.9    | 90.7    | 64.0     | 53.2    | 84.6       | 53.8        | 54.0        | 61.4 |
+| [Rinna ViT-B/16 cloob](https://huggingface.co/rinna/japanese-cloob-vit-b-16)        | 196         | 54.6     | 41.6    | 88.2    | 60.3     | 57.2    | 80.2       | 53.4        | 53.4        | 61.1 |
+| [LY ViT-B/16](https://huggingface.co/line-corporation/clip-japanese-base)                 | 196         | 52.0     | **83.8** | 96.3    | 76.7     | 73.9    | **88.4**   | **76.9**    | **78.0**    | **78.3** |
+| [**llm-jp-ViT-B/16**](https://huggingface.co/llm-jp/llm-jp-clip-vit-base-patch16)        | 248         | 54.2     | 59.4    | 91.8    | 69.2     | _82.2_   | 85.6       | 73.6        | 72.7        | 73.6 |
+| [StabilityAI ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)        | 414         | **62.4** | 70.5    | _97.6_   | **84.1** | 74.0    | 86.7       | 67.3        | 66.0        | 76.1 |
+| [**llm-jp-ViT-L/14**](https://huggingface.co/llm-jp/llm-jp-clip-vit-large-patch14)        | 467         | _59.5_   | 62.9    | 96.4    | 77.0     | **88.2** | _87.8_      | 74.1        | _74.1_      | _77.5_ |
+| **Multilingual CLIP**       |             |          |         |         |          |         |            |             |             |      |
+| [SigLIP B/16-256 multi](https://huggingface.co/google/siglip-base-patch16-256-multilingual)       | 370         | 51.9     | 71.2    | 92.4    | 65.8     | 78.6    | 85.6       | 45.9        | 43.0        | 66.8 |
+| [jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2)                | 865         | 35.8     | 48.1    | 95.1    | 58.3     | 52.0    | 69.4       | 67.3        | 66.4        | 61.6 |
+| [LAION ViT-H/14 multi](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k)        | 1193        | 53.0     | _74.5_   | **97.9** | _78.4_   | 74.3    | 85.1       | _75.0_      | 72.0        | 76.3 |
 # LICENSE
 [The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+Please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.
 # Citation
 Bibtex:
 ```
+@inproceedings{sugiura2025clip,
+author = {杉浦 一瑳 and 栗田 修平 and 小田 悠介 and 河原大輔 and 岡崎 直観},
+month = mar,
+series = {言語処理学会第31回年次大会 (NLP2025)},
+title = {オープンLLMによる翻訳を活用した日本語 CLIP の開発},
+year = {2025}
+}
 ```