Nobuhiro Ueda commited on
Commit
6935ff9
1 Parent(s): 579dc96

add README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ja
3
+ license: cc-by-sa-4.0
4
+ datasets:
5
+ - wikipedia
6
+ - cc100
7
+ mask_token: "[MASK]"
8
+ widget:
9
+ - text: "京都大学で自然言語処理を [MASK] する。"
10
+ ---
11
+
12
+ # ku-nlp/roberta-large-japanese-char-wwm
13
+
14
+ ## Model description
15
+
16
+ This is a Japanese RoBERTa large model pre-trained on Japanese Wikipedia and the Japanese portion of CC-100.
17
+ This model is trained with character-level tokenization and whole word masking.
18
+
19
+ ## How to use
20
+
21
+ You can use this model for masked language modeling as follows:
22
+
23
+ ```python
24
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
25
+ tokenizer = AutoTokenizer.from_pretrained("ku-nlp/roberta-large-japanese-char-wwm")
26
+ model = AutoModelForMaskedLM.from_pretrained("ku-nlp/roberta-large-japanese-char-wwm")
27
+
28
+ sentence = '京都大学で自然言語処理を [MASK] する。'
29
+ encoding = tokenizer(sentence, return_tensors='pt')
30
+ ...
31
+ ```
32
+
33
+ You can fine-tune this model on downstream tasks.
34
+
35
+ ## Tokenization
36
+
37
+ There is no need to tokenize texts in advance, and you can give raw texts to the tokenizer.
38
+ The texts are tokenized into character-level tokens by [sentencepiece](https://github.com/google/sentencepiece).
39
+
40
+ ## Vocabulary
41
+
42
+ The vocabulary consists of 18,377 tokens including all characters that appear in the training corpus.
43
+
44
+ ## Training procedure
45
+
46
+ This model was trained on Japanese Wikipedia (as of 20220220) and the Japanese portion of CC-100. It took a month using 8-16 NVIDIA A100 GPUs.
47
+
48
+ The following hyperparameters were used during pre-training:
49
+
50
+ - learning_rate: 5e-5
51
+ - per_device_train_batch_size: 38
52
+ - distributed_type: multi-GPU
53
+ - num_devices: 16
54
+ - gradient_accumulation_steps: 8
55
+ - total_train_batch_size: 4864
56
+ - max_seq_length: 512
57
+ - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
58
+ - lr_scheduler_type: linear schedule with warmup
59
+ - training_steps: 500000
60
+ - warmup_steps: 10000