init
Browse files- 1_Pooling/config.json +7 -0
- 2_Dense/config.json +1 -0
- 2_Dense/pytorch_model.bin +3 -0
- README.md +74 -0
- README_JA.md +73 -0
- config.json +26 -0
- config_sentence_transformers.json +7 -0
- modules.json +20 -0
- pytorch_model.bin +3 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +7 -0
- tokenizer_config.json +22 -0
- vocab.txt +0 -0
1_Pooling/config.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 768,
|
3 |
+
"pooling_mode_cls_token": true,
|
4 |
+
"pooling_mode_mean_tokens": false,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false
|
7 |
+
}
|
2_Dense/config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"in_features": 768, "out_features": 768, "bias": true, "activation_function": "torch.nn.modules.activation.Tanh"}
|
2_Dense/pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ea3c2537569e3a6ad6c60be3ff84001db62f043945e0b1db4e6e052aad3e4e70
|
3 |
+
size 2363431
|
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
---
|
|
|
|
|
2 |
license: cc-by-sa-4.0
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
language: ja
|
4 |
license: cc-by-sa-4.0
|
5 |
+
tags:
|
6 |
+
- sentence-transformers
|
7 |
+
- feature-extraction
|
8 |
+
- sentence-similarity
|
9 |
+
|
10 |
---
|
11 |
+
|
12 |
+
# Japanese SimCSE (BERT-base)
|
13 |
+
[日本語のREADME/Japanese README](https://huggingface.co/pkshatech/simcse-ja-bert-base-clcmlp/blob/main/README_JA.md)
|
14 |
+
|
15 |
+
## summary
|
16 |
+
model name: `pkshatech/simcse-ja-bert-base-clcmlp`
|
17 |
+
|
18 |
+
|
19 |
+
This is Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on `cl-tohoku/bert-base-japanese-v2` and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), which is Japanese natural language inference dataset.
|
20 |
+
|
21 |
+
|
22 |
+
## Usage (Sentence-Transformers)
|
23 |
+
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
24 |
+
|
25 |
+
You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
|
26 |
+
|
27 |
+
Please install sentence-transformers, fugashi, and unidic-lite with pip as follows:
|
28 |
+
```
|
29 |
+
pip install -U fugashi[unidic-lite] sentence-transformers
|
30 |
+
```
|
31 |
+
|
32 |
+
You can load the model and convert sentences to dense vectors as follows:
|
33 |
+
|
34 |
+
```python
|
35 |
+
from sentence_transformers import SentenceTransformer
|
36 |
+
sentences = [
|
37 |
+
"PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
|
38 |
+
"この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
|
39 |
+
"広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。"
|
40 |
+
]
|
41 |
+
|
42 |
+
model = SentenceTransformer('pkshatech/simcse-ja-bert-base-clcmlp')
|
43 |
+
embeddings = model.encode(sentences)
|
44 |
+
print(embeddings)
|
45 |
+
```
|
46 |
+
|
47 |
+
Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks.
|
48 |
+
|
49 |
+
## Model Detail
|
50 |
+
|
51 |
+
### Tokenization
|
52 |
+
We use the same tokenizer as `tohoku/bert-base-japanese-v2`. Please see the [README of `tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) for details.
|
53 |
+
|
54 |
+
### Training
|
55 |
+
We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the train set of [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88). We trained 20 epochs and published the checkpoint of the model with the highest Spearman's correlation coefficient on the validation set [^1] of the train set of [JSTS](https://github.com/yahoojapan/JGLUE)
|
56 |
+
|
57 |
+
### Training Parameters
|
58 |
+
|
59 |
+
| Parameter | Value |
|
60 |
+
| --- | --- |
|
61 |
+
|pooling_strategy | [CLS] -> single fully-connected layer |
|
62 |
+
| max_seq_length | 128 |
|
63 |
+
| with hard negative | true |
|
64 |
+
| temperature of contrastive loss | 0.05 |
|
65 |
+
| Batch size | 200 |
|
66 |
+
| Learning rate | 1e-5 |
|
67 |
+
| Weight decay | 0.01 |
|
68 |
+
| Max gradient norm | 1.0 |
|
69 |
+
| Warmup steps | 2012 |
|
70 |
+
| Scheduler | WarmupLinear |
|
71 |
+
| Epochs | 20 |
|
72 |
+
| Evaluation steps | 250 |
|
73 |
+
|
74 |
+
|
75 |
+
|
76 |
+
[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.
|
77 |
+
|
README_JA.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
language: ja
|
4 |
+
license: cc-by-sa-4.0
|
5 |
+
tags:
|
6 |
+
- sentence-transformers
|
7 |
+
- feature-extraction
|
8 |
+
- sentence-similarity
|
9 |
+
|
10 |
+
---
|
11 |
+
|
12 |
+
# Japanese SimCSE (BERT-base)
|
13 |
+
|
14 |
+
日本語の[SimCSE](https://arxiv.org/abs/2104.08821)の日本語モデルです。`cl-tohoku/bert-base-japanese-v2`をベースに、日本語自然言語推論データセットである[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)で、学習を行いました。
|
15 |
+
|
16 |
+
|
17 |
+
## Usage (Sentence-Transformers)
|
18 |
+
[sentence-transformers](https://www.SBERT.net)を使って、このモデルを簡単に利用することができます。
|
19 |
+
|
20 |
+
トークナイズのために、[fugashi](https://github.com/polm/fugashi)と[unidic-lite](https://pypi.org/project/unidic-lite/)が必要です。
|
21 |
+
|
22 |
+
下記のように、pipでsentence-transformersとfugashi, unidic-liteをインストールしてください。
|
23 |
+
|
24 |
+
```
|
25 |
+
pip install -U fugashi[unidic-lite]
|
26 |
+
pip install -U sentence-transformers
|
27 |
+
```
|
28 |
+
|
29 |
+
下記のようにモデルをロードして、文を密なベクトルに変換することができます。
|
30 |
+
|
31 |
+
```python
|
32 |
+
from sentence_transformers import SentenceTransformer
|
33 |
+
sentences = [
|
34 |
+
"PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
|
35 |
+
"この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
|
36 |
+
"広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。"
|
37 |
+
]
|
38 |
+
|
39 |
+
model = SentenceTransformer('{model_id}')
|
40 |
+
embeddings = model.encode(sentences)
|
41 |
+
print(embeddings)
|
42 |
+
```
|
43 |
+
|
44 |
+
学習時の損失関数にcosine類似度を使っているため、下流のタスクでcosine類似度を類似度計算に使うことをおすすめします。
|
45 |
+
|
46 |
+
# Model Detail
|
47 |
+
|
48 |
+
## Tokenization
|
49 |
+
`tohoku/bert-base-japanese-v2`と同じトークナイザーを使用しています。詳しくは、[`tohoku/bert-base-japanese-v2`のページ](https://huggingface.co/cl-tohoku/bert-base-japanese-v2)をご覧ください。
|
50 |
+
|
51 |
+
## Training
|
52 |
+
`tohoku/bert-base-japanese-v2`を初期値に設定し、日本語自然言語推論データセットである[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)のtrain setで学習を行いました。 20 epoch 学習を行い、[JSTS](https://github.com/yahoojapan/JGLUE)のtrain setの一部を検証データにして最もSpearmanの順位相関係数が高かったモデルのチェックポイントを公開しています[^1]。
|
53 |
+
|
54 |
+
## Training Parameters
|
55 |
+
|
56 |
+
| Parameter | Value |
|
57 |
+
| --- | --- |
|
58 |
+
|pooling_strategy | [CLS] -> single fully-connected layer |
|
59 |
+
| max_seq_length | 128 |
|
60 |
+
| with hard negative | true |
|
61 |
+
| temperature of contrastive loss | 0.05 |
|
62 |
+
| Batch size | 200 |
|
63 |
+
| Learning rate | 1e-5 |
|
64 |
+
| Weight decay | 0.01 |
|
65 |
+
| Max gradient norm | 1.0 |
|
66 |
+
| Warmup steps | 2012 |
|
67 |
+
| Scheduler | WarmupLinear |
|
68 |
+
| Epochs | 20 |
|
69 |
+
| Evaluation steps | 250 |
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
[^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。
|
config.json
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "cl-tohoku/bert-base-japanese-v2",
|
3 |
+
"architectures": [
|
4 |
+
"BertModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 3072,
|
13 |
+
"layer_norm_eps": 1e-12,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"model_type": "bert",
|
16 |
+
"num_attention_heads": 12,
|
17 |
+
"num_hidden_layers": 12,
|
18 |
+
"pad_token_id": 0,
|
19 |
+
"position_embedding_type": "absolute",
|
20 |
+
"tokenizer_class": "BertJapaneseTokenizer",
|
21 |
+
"torch_dtype": "float32",
|
22 |
+
"transformers_version": "4.25.1",
|
23 |
+
"type_vocab_size": 2,
|
24 |
+
"use_cache": true,
|
25 |
+
"vocab_size": 32768
|
26 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "2.2.2",
|
4 |
+
"transformers": "4.25.1",
|
5 |
+
"pytorch": "1.12.0+cu113"
|
6 |
+
}
|
7 |
+
}
|
modules.json
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
},
|
14 |
+
{
|
15 |
+
"idx": 2,
|
16 |
+
"name": "2",
|
17 |
+
"path": "2_Dense",
|
18 |
+
"type": "sentence_transformers.models.Dense"
|
19 |
+
}
|
20 |
+
]
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ae4edc54026eb831ecddff1013e637db22cf7977cfd731a977b840261e421dac
|
3 |
+
size 444898097
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 128,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer_config.json
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"do_lower_case": false,
|
4 |
+
"do_subword_tokenize": true,
|
5 |
+
"do_word_tokenize": true,
|
6 |
+
"jumanpp_kwargs": null,
|
7 |
+
"mask_token": "[MASK]",
|
8 |
+
"mecab_kwargs": {
|
9 |
+
"mecab_dic": "unidic_lite"
|
10 |
+
},
|
11 |
+
"model_max_length": 1000000000000000019884624838656,
|
12 |
+
"name_or_path": "cl-tohoku/bert-base-japanese-v2",
|
13 |
+
"never_split": null,
|
14 |
+
"pad_token": "[PAD]",
|
15 |
+
"sep_token": "[SEP]",
|
16 |
+
"special_tokens_map_file": null,
|
17 |
+
"subword_tokenizer_type": "wordpiece",
|
18 |
+
"sudachi_kwargs": null,
|
19 |
+
"tokenizer_class": "BertJapaneseTokenizer",
|
20 |
+
"unk_token": "[UNK]",
|
21 |
+
"word_tokenizer_type": "mecab"
|
22 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|