akiFQC commited on
Commit
a6dc7c2
·
1 Parent(s): 18b35f1

small update of readmes

Browse files
Files changed (3) hide show
  1. README.md +5 -2
  2. README_JA.md +2 -0
  3. tokenizer_config.json +4 -18
README.md CHANGED
@@ -16,11 +16,11 @@ tags:
16
  model name: `pkshatech/simcse-ja-bert-base-clcmlp`
17
 
18
 
19
- This is Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on `cl-tohoku/bert-base-japanese-v2` and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), which is Japanese natural language inference dataset.
20
 
21
 
22
  ## Usage (Sentence-Transformers)
23
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
24
 
25
  You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
26
 
@@ -72,6 +72,9 @@ We set `tohoku/bert-base-japanese-v2` as the initial value and trained it on the
72
  | Evaluation steps | 250 |
73
 
74
 
 
 
 
75
 
76
  [^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.
77
 
 
16
  model name: `pkshatech/simcse-ja-bert-base-clcmlp`
17
 
18
 
19
+ This is a Japanese [SimCSE](https://arxiv.org/abs/2104.08821) model. You can easily extract sentence embedding representations from Japanese sentences. This model is based on [`cl-tohoku/bert-base-japanese-v2`](https://huggingface.co/cl-tohoku/bert-base-japanese-v2) and trained on [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) dataset, which is a Japanese natural language inference dataset.
20
 
21
 
22
  ## Usage (Sentence-Transformers)
23
+ You can use this model easily with [sentence-transformers](https://www.SBERT.net).
24
 
25
  You need [fugashi](https://github.com/polm/fugashi) and [unidic-lite](https://pypi.org/project/unidic-lite/) for tokenization.
26
 
 
72
  | Evaluation steps | 250 |
73
 
74
 
75
+ # Licenses
76
+ This models are distributed under the terms of the Creative [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
77
+
78
 
79
  [^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE insted of its dev set.
80
 
README_JA.md CHANGED
@@ -69,5 +69,7 @@ print(embeddings)
69
  | Evaluation steps | 250 |
70
 
71
 
 
 
72
 
73
  [^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。
 
69
  | Evaluation steps | 250 |
70
 
71
 
72
+ # ライセンス
73
+ このモデルは [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/)の下でライセンスされています。
74
 
75
  [^1]: モデル学習時には、JGLUEのテストデータが公開されていなかったため、プライベートな評価データとしてJGLUEのdev setを使用していました。そのため、玉突き的にJGLUEのtrain setでcheckpointを選択しています。
tokenizer_config.json CHANGED
@@ -1,22 +1,8 @@
1
  {
2
- "cls_token": "[CLS]",
3
  "do_lower_case": false,
4
- "do_subword_tokenize": true,
5
- "do_word_tokenize": true,
6
- "jumanpp_kwargs": null,
7
- "mask_token": "[MASK]",
8
- "mecab_kwargs": {
9
- "mecab_dic": "unidic_lite"
10
- },
11
- "model_max_length": 1000000000000000019884624838656,
12
- "name_or_path": "cl-tohoku/bert-base-japanese-v2",
13
- "never_split": null,
14
- "pad_token": "[PAD]",
15
- "sep_token": "[SEP]",
16
- "special_tokens_map_file": null,
17
  "subword_tokenizer_type": "wordpiece",
18
- "sudachi_kwargs": null,
19
- "tokenizer_class": "BertJapaneseTokenizer",
20
- "unk_token": "[UNK]",
21
- "word_tokenizer_type": "mecab"
22
  }
 
1
  {
 
2
  "do_lower_case": false,
3
+ "word_tokenizer_type": "mecab",
 
 
 
 
 
 
 
 
 
 
 
 
4
  "subword_tokenizer_type": "wordpiece",
5
+ "mecab_kwargs": {
6
+ "mecab_dic": "unidic_lite"
7
+ }
 
8
  }