pkshatech
/

simcse-ja-bert-base-clcmlp

@@ -16,11 +16,11 @@ tags:
 model name: `pkshatech/simcse-ja-bert-base-clcmlp`
-This is Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on `cl-tohoku/bert-base-japanese-v2` and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), which is Japanese natural language inference dataset.
 ## Usage (Sentence-Transformers)
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
@@ -72,6 +72,9 @@ We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the
 | Evaluation steps | 250 |
 [^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.

 model name: `pkshatech/simcse-ja-bert-base-clcmlp`
+This is a Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on [`cl-tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) dataset, which is a Japanese natural language inference dataset.
 ## Usage (Sentence-Transformers)
+You can use this model easily with [sentence-transformers](https://www.SBERT.net).
 You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
 | Evaluation steps | 250 |
+# Licenses
+This models are distributed under the terms of the Creative [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
 [^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.

README_JA.md CHANGED Viewed

@@ -69,5 +69,7 @@ print(embeddings)
 | Evaluation steps | 250 |
 [^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。

 | Evaluation steps | 250 |
+# ライセンス
+このモデルは [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)の下でライセンスされています。
 [^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。

tokenizer_config.json CHANGED Viewed

@@ -1,22 +1,8 @@
 {
-  "cls_token": "[CLS]",
   "do_lower_case": false,
-  "do_subword_tokenize": true,
-  "do_word_tokenize": true,
-  "jumanpp_kwargs": null,
-  "mask_token": "[MASK]",
-  "mecab_kwargs": {
-    "mecab_dic": "unidic_lite"
-  },
-  "model_max_length": 1000000000000000019884624838656,
-  "name_or_path": "cl-tohoku/bert-base-japanese-v2",
-  "never_split": null,
-  "pad_token": "[PAD]",
-  "sep_token": "[SEP]",
-  "special_tokens_map_file": null,
   "subword_tokenizer_type": "wordpiece",
-  "sudachi_kwargs": null,
-  "tokenizer_class": "BertJapaneseTokenizer",
-  "unk_token": "[UNK]",
-  "word_tokenizer_type": "mecab"
 }

 {
   "do_lower_case": false,
+  "word_tokenizer_type": "mecab",
   "subword_tokenizer_type": "wordpiece",
+  "mecab_kwargs": {
+      "mecab_dic": "unidic_lite"
+  }
 }