|
--- |
|
pipeline_tag: sentence-similarity |
|
language: ja |
|
license: cc-by-sa-4.0 |
|
tags: |
|
- transformers |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
|
|
--- |
|
|
|
# Japanese SimCSE (BERT-base) |
|
[日本語のREADME/Japanese README](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp/blob/main/README_JA.md) |
|
|
|
## summary |
|
model name: `pkshatech/simcse-ja-bert-base-clcmlp` |
|
|
|
|
|
This is a Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on [`cl-tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) dataset, which is a Japanese natural language inference dataset. |
|
|
|
|
|
## Usage (Sentence-Transformers) |
|
You can use this model easily with [sentence-transformers](https://www.SBERT.net). |
|
|
|
You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization. |
|
|
|
Please install sentence-transformers, fugashi, and unidic-lite with pip as follows: |
|
``` |
|
pip install -U fugashi[unidic-lite] sentence-transformers |
|
``` |
|
|
|
You can load the model and convert sentences to dense vectors as follows: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = [ |
|
"PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。", |
|
"この深層学習モデルはPKSHA Technologyによって学習され、公開された。", |
|
"広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。", |
|
] |
|
|
|
model = SentenceTransformer('pkshatech/simcse-ja-bert-base-clcmlp') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks. |
|
|
|
## Model Detail |
|
|
|
### Tokenization |
|
We use the same tokenizer as `tohoku/bert-base-japanese-v2`. Please see the [README of `tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) for details. |
|
|
|
### Training |
|
We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the train set of [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88). We trained 20 epochs and published the checkpoint of the model with the highest Spearman's correlation coefficient on the validation set [^1] of the train set of [JSTS](https://github.com/yahoojapan/JGLUE) |
|
|
|
### Training Parameters |
|
|
|
| Parameter | Value | |
|
| --- | --- | |
|
|pooling_strategy | [CLS] -> single fully-connected layer | |
|
| max_seq_length | 128 | |
|
| with hard negative | true | |
|
| temperature of contrastive loss | 0.05 | |
|
| Batch size | 200 | |
|
| Learning rate | 1e-5 | |
|
| Weight decay | 0.01 | |
|
| Max gradient norm | 1.0 | |
|
| Warmup steps | 2012 | |
|
| Scheduler | WarmupLinear | |
|
| Epochs | 20 | |
|
| Evaluation steps | 250 | |
|
|
|
|
|
# Licenses |
|
This models are distributed under the terms of the Creative [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). |
|
|
|
|
|
[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set. |
|
|
|
|