karthigakannaiyan commited on
Commit
f016346
·
verified ·
1 Parent(s): 6258eab

Upload 9 files

Browse files
Files changed (9) hide show
  1. README.md +98 -14
  2. added_tokens.json +1 -0
  3. config.json +43 -0
  4. final.py +50 -0
  5. merges.txt +0 -0
  6. requirements.txt +4 -0
  7. special_tokens_map.json +147 -0
  8. tokenizer_config.json +62 -0
  9. vocab.json +0 -0
README.md CHANGED
@@ -1,14 +1,98 @@
1
- ---
2
- title: Code Comment Generator Using DL
3
- emoji: 📊
4
- colorFrom: purple
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.35.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: Gradio-based Code Comment Generator app
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ tags:
4
+ - codet5
5
+ datasets:
6
+ - code_search_net
7
+ inference: true
8
+ ---
9
+
10
+ # CodeT5-base for Code Summarization
11
+
12
+ [CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data in a multi-lingual training setting (
13
+ Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
14
+ paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
15
+ by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
16
+ at [this repository](https://github.com/salesforce/CodeT5).
17
+
18
+ ## How to use
19
+
20
+ Here is how to use this model:
21
+
22
+ ```python
23
+ from transformers import RobertaTokenizer, T5ForConditionalGeneration
24
+
25
+ if __name__ == '__main__':
26
+ tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum')
27
+ model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
28
+
29
+ text = """def svg_to_image(string, size=None):
30
+ if isinstance(string, unicode):
31
+ string = string.encode('utf-8')
32
+ renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
33
+ if not renderer.isValid():
34
+ raise ValueError('Invalid SVG data.')
35
+ if size is None:
36
+ size = renderer.defaultSize()
37
+ image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
38
+ painter = QtGui.QPainter(image)
39
+ renderer.render(painter)
40
+ return image"""
41
+
42
+ input_ids = tokenizer(text, return_tensors="pt").input_ids
43
+
44
+ generated_ids = model.generate(input_ids, max_length=20)
45
+ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
46
+ # this prints: "Convert a SVG string to a QImage."
47
+ ```
48
+
49
+ ## Fine-tuning data
50
+
51
+ We employ the filtered version of CodeSearchNet data [[Husain et al., 2019](https://arxiv.org/abs/1909.09436)]
52
+ from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
53
+ code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
54
+ prepare text (or code) for the model using RobertaTokenizer with the vocab files from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
55
+
56
+ ### Data statistic
57
+
58
+ | Programming Language | Training | Dev | Test |
59
+ | :------------------- | :------: | :----: | :----: |
60
+ | Python | 251,820 | 13,914 | 14,918 |
61
+ | PHP | 241,241 | 12,982 | 14,014 |
62
+ | Go | 167,288 | 7,325 | 8,122 |
63
+ | Java | 164,923 | 5,183 | 10,955 |
64
+ | JavaScript | 58,025 | 3,885 | 3,291 |
65
+ | Ruby | 24,927 | 1,400 | 1,261 |
66
+
67
+ ## Training procedure
68
+
69
+ We fine-tune codet5-base on these six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ the
70
+ balanced sampling to avoid biasing towards high-resource tasks. Please refer to the [paper](https://arxiv.org/abs/2109.00859) for more details.
71
+
72
+ ## Evaluation results
73
+
74
+ Unlike the paper allowing to select different best checkpoints for different programming languages (PLs), here we employ one checkpoint for
75
+ all PLs. Besides, we remove the task control prefix to specify the PL in training and inference. The results on the test set are shown as below:
76
+
77
+ | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
78
+ | ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
79
+ | Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 |
80
+ | Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 |
81
+ | [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 |
82
+ | [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
83
+ | [PLBART](https://aclanthology.org/2021.naacl-main.211.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
84
+ | [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 |
85
+ | [CodeT5-base](https://arxiv.org/abs/2109.00859) | **15.24** | 16.16 | 19.56 | 20.01 | **20.31** | 26.03 | 19.55 |
86
+ | [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | 20.26 | **26.10** | **19.69** |
87
+
88
+ ## Citation
89
+
90
+ ```bibtex
91
+ @inproceedings{
92
+ wang2021codet5,
93
+ title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
94
+ author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
95
+ booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
96
+ year={2021},
97
+ }
98
+ ```
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "T5ForConditionalGeneration"
4
+ ],
5
+ "bos_token_id": 1,
6
+ "d_ff": 3072,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 2,
12
+ "feed_forward_proj": "relu",
13
+ "id2label": {
14
+ "0": "LABEL_0"
15
+ },
16
+ "initializer_factor": 1.0,
17
+ "is_encoder_decoder": true,
18
+ "label2id": {
19
+ "LABEL_0": 0
20
+ },
21
+ "layer_norm_epsilon": 1e-06,
22
+ "model_type": "t5",
23
+ "n_positions": 512,
24
+ "num_decoder_layers": 12,
25
+ "num_heads": 12,
26
+ "num_layers": 12,
27
+ "output_past": true,
28
+ "pad_token_id": 0,
29
+ "relative_attention_num_buckets": 32,
30
+ "task_specific_params": {
31
+ "summarization": {
32
+ "early_stopping": true,
33
+ "length_penalty": 2.0,
34
+ "max_length": 256,
35
+ "min_length": 1,
36
+ "no_repeat_ngram_size": 3,
37
+ "num_beams": 5
38
+ }
39
+ },
40
+ "transformers_version": "4.5.0",
41
+ "use_cache": true,
42
+ "vocab_size": 32100
43
+ }
final.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ from transformers import RobertaTokenizer, T5ForConditionalGeneration
4
+ from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
5
+ import nltk
6
+
7
+ nltk.download('punkt')
8
+
9
+ # Load model and tokenizer
10
+ model_dir = "./codet5-base-multi-sum"
11
+ tokenizer = RobertaTokenizer.from_pretrained(model_dir)
12
+ model = T5ForConditionalGeneration.from_pretrained(model_dir)
13
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
14
+ model.to(device)
15
+
16
+ def generate_comment(code_snippet, reference_comment):
17
+ # Add prefix for summarization task
18
+ prefixed_code = "summarize: " + code_snippet.strip()
19
+ input_ids = tokenizer(prefixed_code, return_tensors="pt").input_ids.to(device)
20
+ generated_ids = model.generate(input_ids, max_length=64, num_beams=4, early_stopping=True)
21
+ comment = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
22
+
23
+ # Tokenize and compute BLEU against user-provided reference
24
+ if reference_comment.strip():
25
+ ref_tokens = nltk.word_tokenize(reference_comment.lower())
26
+ hyp_tokens = nltk.word_tokenize(comment.lower())
27
+ bleu = sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=SmoothingFunction().method1)
28
+ bleu = round(bleu, 2)
29
+ else:
30
+ bleu = "N/A (No reference provided)"
31
+
32
+ return comment, bleu
33
+
34
+
35
+ # Gradio UI
36
+ iface = gr.Interface(
37
+ fn=generate_comment,
38
+ inputs=[
39
+ gr.Textbox(label="Enter Code Snippet", lines=4, placeholder="Paste your code here..."),
40
+ gr.Textbox(label="Reference Comment (optional)", placeholder="Expected comment to compare BLEU score"),
41
+ ],
42
+ outputs=[
43
+ gr.Textbox(label="Generated Comment"),
44
+ gr.Textbox(label="BLEU Score"),
45
+ ],
46
+ title="Code Comment Generator using CodeT5",
47
+ description="Paste code and get a generated comment with BLEU score (optional reference)."
48
+ )
49
+
50
+ iface.launch()
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ transformers
2
+ gradio
3
+ torch
4
+ nltk
special_tokens_map.json ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "single_word": false,
5
+ "lstrip": false,
6
+ "rstrip": false,
7
+ "normalized": true
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "single_word": false,
12
+ "lstrip": false,
13
+ "rstrip": false,
14
+ "normalized": true
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": true
22
+ },
23
+ "sep_token": {
24
+ "content": "</s>",
25
+ "single_word": false,
26
+ "lstrip": false,
27
+ "rstrip": false,
28
+ "normalized": true
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "single_word": false,
33
+ "lstrip": false,
34
+ "rstrip": false,
35
+ "normalized": true
36
+ },
37
+ "cls_token": {
38
+ "content": "<s>",
39
+ "single_word": false,
40
+ "lstrip": false,
41
+ "rstrip": false,
42
+ "normalized": true
43
+ },
44
+ "mask_token": { "content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
45
+ "additional_special_tokens": [
46
+ { "content":"<extra_id_99>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
47
+ { "content":"<extra_id_98>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
48
+ { "content":"<extra_id_97>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
49
+ { "content":"<extra_id_96>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
50
+ { "content":"<extra_id_95>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
51
+ { "content":"<extra_id_94>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
52
+ { "content":"<extra_id_93>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
53
+ { "content":"<extra_id_92>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
54
+ { "content":"<extra_id_91>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
55
+ { "content":"<extra_id_90>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
56
+ { "content":"<extra_id_89>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
57
+ { "content":"<extra_id_88>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
58
+ { "content":"<extra_id_87>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
59
+ { "content":"<extra_id_86>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
60
+ { "content":"<extra_id_85>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
61
+ { "content":"<extra_id_84>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
62
+ { "content":"<extra_id_83>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
63
+ { "content":"<extra_id_82>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
64
+ { "content":"<extra_id_81>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
65
+ { "content":"<extra_id_80>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
66
+ { "content":"<extra_id_79>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
67
+ { "content":"<extra_id_78>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
68
+ { "content":"<extra_id_77>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
69
+ { "content":"<extra_id_76>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
70
+ { "content":"<extra_id_75>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
71
+ { "content":"<extra_id_74>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
72
+ { "content":"<extra_id_73>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
73
+ { "content":"<extra_id_72>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
74
+ { "content":"<extra_id_71>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
75
+ { "content":"<extra_id_70>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
76
+ { "content":"<extra_id_69>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
77
+ { "content":"<extra_id_68>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
78
+ { "content":"<extra_id_67>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
79
+ { "content":"<extra_id_66>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
80
+ { "content":"<extra_id_65>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
81
+ { "content":"<extra_id_64>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
82
+ { "content":"<extra_id_63>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
83
+ { "content":"<extra_id_62>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
84
+ { "content":"<extra_id_61>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
85
+ { "content":"<extra_id_60>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
86
+ { "content":"<extra_id_59>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
87
+ { "content":"<extra_id_58>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
88
+ { "content":"<extra_id_57>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
89
+ { "content":"<extra_id_56>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
90
+ { "content":"<extra_id_55>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
91
+ { "content":"<extra_id_54>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
92
+ { "content":"<extra_id_53>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
93
+ { "content":"<extra_id_52>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
94
+ { "content":"<extra_id_51>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
95
+ { "content":"<extra_id_50>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
96
+ { "content":"<extra_id_49>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
97
+ { "content":"<extra_id_48>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
98
+ { "content":"<extra_id_47>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
99
+ { "content":"<extra_id_46>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
100
+ { "content":"<extra_id_45>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
101
+ { "content":"<extra_id_44>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
102
+ { "content":"<extra_id_43>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
103
+ { "content":"<extra_id_42>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
104
+ { "content":"<extra_id_41>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
105
+ { "content":"<extra_id_40>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
106
+ { "content":"<extra_id_39>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
107
+ { "content":"<extra_id_38>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
108
+ { "content":"<extra_id_37>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
109
+ { "content":"<extra_id_36>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
110
+ { "content":"<extra_id_35>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
111
+ { "content":"<extra_id_34>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
112
+ { "content":"<extra_id_33>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
113
+ { "content":"<extra_id_32>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
114
+ { "content":"<extra_id_31>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
115
+ { "content":"<extra_id_30>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
116
+ { "content":"<extra_id_29>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
117
+ { "content":"<extra_id_28>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
118
+ { "content":"<extra_id_27>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
119
+ { "content":"<extra_id_26>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
120
+ { "content":"<extra_id_25>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
121
+ { "content":"<extra_id_24>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
122
+ { "content":"<extra_id_23>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
123
+ { "content":"<extra_id_22>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
124
+ { "content":"<extra_id_21>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
125
+ { "content":"<extra_id_20>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
126
+ { "content":"<extra_id_19>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
127
+ { "content":"<extra_id_18>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
128
+ { "content":"<extra_id_17>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
129
+ { "content":"<extra_id_16>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
130
+ { "content":"<extra_id_15>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
131
+ { "content":"<extra_id_14>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
132
+ { "content":"<extra_id_13>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
133
+ { "content":"<extra_id_12>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
134
+ { "content":"<extra_id_11>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
135
+ { "content":"<extra_id_10>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
136
+ { "content":"<extra_id_9>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
137
+ { "content":"<extra_id_8>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
138
+ { "content":"<extra_id_7>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
139
+ { "content":"<extra_id_6>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
140
+ { "content":"<extra_id_5>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
141
+ { "content":"<extra_id_4>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
142
+ { "content":"<extra_id_3>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
143
+ { "content":"<extra_id_2>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
144
+ { "content":"<extra_id_1>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true },
145
+ { "content":"<extra_id_0>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true }
146
+ ]
147
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "errors": "replace",
3
+ "unk_token": {
4
+ "content": "<unk>",
5
+ "single_word": false,
6
+ "lstrip": false,
7
+ "rstrip": false,
8
+ "normalized": true,
9
+ "__type": "AddedToken"
10
+ },
11
+ "bos_token": {
12
+ "content": "<s>",
13
+ "single_word": false,
14
+ "lstrip": false,
15
+ "rstrip": false,
16
+ "normalized": true,
17
+ "__type": "AddedToken"
18
+ },
19
+ "eos_token": {
20
+ "content": "</s>",
21
+ "single_word": false,
22
+ "lstrip": false,
23
+ "rstrip": false,
24
+ "normalized": true,
25
+ "__type": "AddedToken"
26
+ },
27
+ "add_prefix_space": false,
28
+ "sep_token": {
29
+ "content": "</s>",
30
+ "single_word": false,
31
+ "lstrip": false,
32
+ "rstrip": false,
33
+ "normalized": true,
34
+ "__type": "AddedToken"
35
+ },
36
+ "cls_token": {
37
+ "content": "<s>",
38
+ "single_word": false,
39
+ "lstrip": false,
40
+ "rstrip": false,
41
+ "normalized": true,
42
+ "__type": "AddedToken"
43
+ },
44
+ "pad_token": {
45
+ "content": "<pad>",
46
+ "single_word": false,
47
+ "lstrip": false,
48
+ "rstrip": false,
49
+ "normalized": true,
50
+ "__type": "AddedToken"
51
+ },
52
+ "mask_token": {
53
+ "content": "<mask>",
54
+ "single_word": false,
55
+ "lstrip": true,
56
+ "rstrip": false,
57
+ "normalized": true,
58
+ "__type": "AddedToken"
59
+ },
60
+ "model_max_length": 512,
61
+ "tokenizer_class": "RobertaTokenizer"
62
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff