akiFQC's picture
update tokenizer
63782c6
|
raw
history blame
3.56 kB
metadata
pipeline_tag: sentence-similarity
language: ja
license: cc-by-sa-4.0
tags:
  - transformers
  - sentence-transformers
  - feature-extraction
  - sentence-similarity

Japanese SimCSE (BERT-base)

日本語のREADME/Japanese README

summary

model name: pkshatech/simcse-ja-bert-base-clcmlp

This is a Japanese SimCSE model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on cl-tohoku/bert-base-japanese-v2 and trained on JSNLI dataset, which is a Japanese natural language inference dataset.

Usage (Sentence-Transformers)

You can use this model easily with sentence-transformers.

You need fugashi and unidic-lite for tokenization.

Please install sentence-transformers, fugashi, and unidic-lite with pip as follows:

pip install -U fugashi[unidic-lite] sentence-transformers

You can load the model and convert sentences to dense vectors as follows:

from sentence_transformers import SentenceTransformer
sentences = [
    "PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
    "この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
    "広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。",
]

model = SentenceTransformer('pkshatech/simcse-ja-bert-base-clcmlp')
embeddings = model.encode(sentences)
print(embeddings)

Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks.

Model Detail

Tokenization

We use the same tokenizer as tohoku/bert-base-japanese-v2. Please see the README of tohoku/bert-base-japanese-v2 for details.

Training

We set tohoku/bert-base-japanese-v2 as the initial value and trained it on the train set of JSNLI. We trained 20 epochs and published the checkpoint of the model with the highest Spearman's correlation coefficient on the validation set [^1] of the train set of JSTS

Training Parameters

Parameter Value
pooling_strategy [CLS] -> single fully-connected layer
max_seq_length 128
with hard negative true
temperature of contrastive loss 0.05
Batch size 200
Learning rate 1e-5
Weight decay 0.01
Max gradient norm 1.0
Warmup steps 2012
Scheduler WarmupLinear
Epochs 20
Evaluation steps 250

Licenses

This models are distributed under the terms of the Creative Creative Commons Attribution-ShareAlike 4.0.

[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.