|
--- |
|
language: |
|
- code |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- code |
|
- gpt2 |
|
- generation |
|
datasets: |
|
- codeparrot/codeparrot-clean |
|
- openai_humaneval |
|
- semeru/code-text-python |
|
- semeru/galeras-causal4se-3k-levenshtein |
|
metrics: |
|
- evaluate-metric/code_eval |
|
--- |
|
|
|
# Compatibilized CodeParrot π¦ (small) |
|
|
|
This is the compatibilized version of CodeParrot π¦ is a GPT-2 model (110M parameters) trained to generate Python code. |
|
|
|
The compatibilization is based on the [sequential-rationales](https://github.com/keyonvafa/sequential-rationales) process formulated by Vafa et.al. |
|
|
|
## Usage |
|
|
|
You can load the CodeParrot model and tokenizer directly in `transformers` and use Galeras dataset for sampling the model: |
|
|
|
```Python |
|
from transformers import AutoTokenizer, AutoModelWithLMHead |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small") |
|
model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small") |
|
|
|
df_sampled_code['size'] = df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids'])) |
|
df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids'] |
|
|
|
``` |
|
|
|
## Training |
|
|
|
The model was trained on the cleaned [CodeParrot π¦ dataset](https://huggingface.co/datasets/codeparrot/codeparrot-clean) with the following settings: |
|
|
|
|Config|Value| |
|
|-------|-----| |
|
|Batch size| 192 | |
|
|Context size| 1024 | |
|
|Training steps| 150'000| |
|
|Gradient accumulation| 1| |
|
|Gradient checkpointing| False| |
|
|Learning rate| 5e-4 | |
|
|Weight decay | 0.1 | |
|
|Warmup steps| 2000 | |
|
|Schedule| Cosine | |
|
|
|
The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens. |
|
|
|
## Performance |
|
|
|
We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges: |
|
|
|
| Metric | Value | |
|
|-------|-----| |
|
|pass@1 | 3.80% | |
|
|pass@10 | 6.57% | |
|
|pass@100 | 12.78% | |
|
|
|
The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests. |
|
|
|
## Resources |
|
|
|
- Dataset: [full](https://huggingface.co/datasets/codeparrot/codeparrot-clean), [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train), [valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid) |
|
- Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot) |
|
- Spaces: [generation](), [highlighting]() |