danielrcardenas commited on
Commit
792df68
·
verified ·
1 Parent(s): 5014a75

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ license: apache-2.0
5
+ tags:
6
+ - code
7
+ - gpt2
8
+ - generation
9
+ datasets:
10
+ - "codeparrot/codeparrot-clean"
11
+ - "openai_humaneval"
12
+ metrics:
13
+ - "evaluate-metric/code_eval"
14
+ ---
15
+
16
+ # Compatibilized CodeParrot 🦜 (small)
17
+
18
+ This is the compatibilized version of CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code.
19
+
20
+ The compatibilization is based on the [sequential-rationales](https://github.com/keyonvafa/sequential-rationales) process formulated by Vafa et.al.
21
+
22
+ ## Usage
23
+
24
+ You can load the CodeParrot model and tokenizer directly in `transformers` and use Galeras dataset for sampling the model:
25
+
26
+ ```Python
27
+ from transformers import AutoTokenizer, AutoModelWithLMHead
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small")
30
+ model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small")
31
+
32
+ df_sampled_code['size'] = df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids']))
33
+ df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids']
34
+
35
+ ```
36
+
37
+ or with a `pipeline`:
38
+
39
+ ```Python
40
+ from transformers import pipeline
41
+
42
+ pipe = pipeline("text-generation", model="codeparrot/codeparrot-small")
43
+ outputs = pipe("def hello_world():")
44
+ ```
45
+
46
+ ## Training
47
+
48
+ The model was trained on the cleaned [CodeParrot 🦜 dataset](https://huggingface.co/datasets/codeparrot/codeparrot-clean) with the following settings:
49
+
50
+ |Config|Value|
51
+ |-------|-----|
52
+ |Batch size| 192 |
53
+ |Context size| 1024 |
54
+ |Training steps| 150'000|
55
+ |Gradient accumulation| 1|
56
+ |Gradient checkpointing| False|
57
+ |Learning rate| 5e-4 |
58
+ |Weight decay | 0.1 |
59
+ |Warmup steps| 2000 |
60
+ |Schedule| Cosine |
61
+
62
+ The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.
63
+
64
+ ## Performance
65
+
66
+ We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges:
67
+
68
+ | Metric | Value |
69
+ |-------|-----|
70
+ |pass@1 | 3.80% |
71
+ |pass@10 | 6.57% |
72
+ |pass@100 | 12.78% |
73
+
74
+ The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests.
75
+
76
+ ## Resources
77
+
78
+ - Dataset: [full](https://huggingface.co/datasets/codeparrot/codeparrot-clean), [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train), [valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid)
79
+ - Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot)
80
+ - Spaces: [generation](), [highlighting]()