small update of readmes
Browse files- README.md +5 -2
- README_JA.md +2 -0
- tokenizer_config.json +4 -18
README.md
CHANGED
@@ -16,11 +16,11 @@ tags:
|
|
16 |
model name: `pkshatech/simcse-ja-bert-base-clcmlp`
|
17 |
|
18 |
|
19 |
-
This is Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on `cl-tohoku/bert-base-japanese-v2` and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), which is Japanese natural language inference dataset.
|
20 |
|
21 |
|
22 |
## Usage (Sentence-Transformers)
|
23 |
-
|
24 |
|
25 |
You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
|
26 |
|
@@ -72,6 +72,9 @@ We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the
|
|
72 |
| Evaluation steps | 250 |
|
73 |
|
74 |
|
|
|
|
|
|
|
75 |
|
76 |
[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.
|
77 |
|
|
|
16 |
model name: `pkshatech/simcse-ja-bert-base-clcmlp`
|
17 |
|
18 |
|
19 |
+
This is a Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on [`cl-tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) dataset, which is a Japanese natural language inference dataset.
|
20 |
|
21 |
|
22 |
## Usage (Sentence-Transformers)
|
23 |
+
You can use this model easily with [sentence-transformers](https://www.SBERT.net).
|
24 |
|
25 |
You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
|
26 |
|
|
|
72 |
| Evaluation steps | 250 |
|
73 |
|
74 |
|
75 |
+
# Licenses
|
76 |
+
This models are distributed under the terms of the Creative [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
|
77 |
+
|
78 |
|
79 |
[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.
|
80 |
|
README_JA.md
CHANGED
@@ -69,5 +69,7 @@ print(embeddings)
|
|
69 |
| Evaluation steps | 250 |
|
70 |
|
71 |
|
|
|
|
|
72 |
|
73 |
[^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。
|
|
|
69 |
| Evaluation steps | 250 |
|
70 |
|
71 |
|
72 |
+
# ライセンス
|
73 |
+
このモデルは [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)の下でライセンスされています。
|
74 |
|
75 |
[^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。
|
tokenizer_config.json
CHANGED
@@ -1,22 +1,8 @@
|
|
1 |
{
|
2 |
-
"cls_token": "[CLS]",
|
3 |
"do_lower_case": false,
|
4 |
-
"
|
5 |
-
"do_word_tokenize": true,
|
6 |
-
"jumanpp_kwargs": null,
|
7 |
-
"mask_token": "[MASK]",
|
8 |
-
"mecab_kwargs": {
|
9 |
-
"mecab_dic": "unidic_lite"
|
10 |
-
},
|
11 |
-
"model_max_length": 1000000000000000019884624838656,
|
12 |
-
"name_or_path": "cl-tohoku/bert-base-japanese-v2",
|
13 |
-
"never_split": null,
|
14 |
-
"pad_token": "[PAD]",
|
15 |
-
"sep_token": "[SEP]",
|
16 |
-
"special_tokens_map_file": null,
|
17 |
"subword_tokenizer_type": "wordpiece",
|
18 |
-
"
|
19 |
-
|
20 |
-
|
21 |
-
"word_tokenizer_type": "mecab"
|
22 |
}
|
|
|
1 |
{
|
|
|
2 |
"do_lower_case": false,
|
3 |
+
"word_tokenizer_type": "mecab",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
"subword_tokenizer_type": "wordpiece",
|
5 |
+
"mecab_kwargs": {
|
6 |
+
"mecab_dic": "unidic_lite"
|
7 |
+
}
|
|
|
8 |
}
|