File size: 4,199 Bytes
6a8c5e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
language:
- en
tags:
- code
- autocomplete
- pytorch
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
widget:
  - text: import torch.nn as
---

# GPT2 for Code AutoComplete Model
code-autocomplete, a code completion plugin for Python.

**code-autocomplete** can automatically complete the code of lines and blocks with GPT2.

## Usage

Open source repo:[code-autocomplete](https://github.com/shibing624/code-autocomplete),support GPT2 model, usage:

```python
from autocomplete.gpt2_coder import GPT2Coder

m = GPT2Coder("shibing624/code-autocomplete-distilgpt2-python")
print(m.generate('import torch.nn as')[0])
```

Also, use huggingface/transformers:

*Please use 'GPT2' related functions to load this model!*

```python
import os
from transformers import GPT2Tokenizer, GPT2LMHeadModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

tokenizer = GPT2Tokenizer.from_pretrained("shibing624/code-autocomplete-distilgpt2-python")
model = GPT2LMHeadModel.from_pretrained("shibing624/code-autocomplete-distilgpt2-python")

prompts = [
    """from torch import nn
    class LSTM(Module):
        def __init__(self, *,
                     n_tokens: int,
                     embedding_size: int,
                     hidden_size: int,
                     n_layers: int):""",
    """import numpy as np
    import torch
    import torch.nn as""",
    "import java.util.ArrayList",
    "def factorial(n):",
]
for prompt in prompts:
    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt')
    outputs = model.generate(input_ids=input_ids,
                             max_length=64 + len(prompt),
                             temperature=1.0,
                             top_k=50,
                             top_p=0.95,
                             repetition_penalty=1.0,
                             do_sample=True,
                             num_return_sequences=1,
                             length_penalty=2.0,
                             early_stopping=True)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(decoded)
    print("=" * 20)
```

output:
```shell
from torch import nn
    class LSTM(Module):
        def __init__(self, *,
                     n_tokens: int,
                     embedding_size: int,
                     hidden_size: int,
                     n_layers: int):
            self.embedding_size = embedding_size
====================
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
```

Model files:
```
code-autocomplete-distilgpt2-python
├── config.json
├── merges.txt
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
└── vocab.json
```

### Train data
#### pytorch_awesome projects source code

download [code-autocomplete](https://github.com/shibing624/code-autocomplete),
```shell
cd autocomplete
python create_dataset.py
```

If you want train code-autocomplete GPT2 model,refer [https://github.com/shibing624/code-autocomplete/blob/main/autocomplete/gpt2_coder.py](https://github.com/shibing624/code-autocomplete/blob/main/autocomplete/gpt2_coder.py)


### About GPT2

Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
and first released at [this page](https://openai.com/blog/better-language-models/).

Disclaimer: The team releasing GPT-2 also wrote a
[model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card
has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.


## Citation

```latex
@misc{code-autocomplete,
  author = {Xu Ming},
  title = {code-autocomplete: Code AutoComplete with GPT model},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/shibing624/code-autocomplete},
}
```