semeru
/

compatible-codeparrot-small

Model card Files Files and versions Community

danielrcardenas commited on Aug 19, 2024

Commit

792df68

·

verified ·

1 Parent(s): 5014a75

Create README.md

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+language:
+  - code
+license: apache-2.0
+tags:
+- code
+- gpt2
+- generation
+datasets:
+- "codeparrot/codeparrot-clean"
+- "openai_humaneval"
+metrics:
+- "evaluate-metric/code_eval"
+---
+# Compatibilized CodeParrot 🦜 (small)
+This is the compatibilized version of CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code.
+The compatibilization is based on the [sequential-rationales](https://github.com/keyonvafa/sequential-rationales) process formulated by Vafa et.al.
+## Usage
+You can load the CodeParrot model and tokenizer directly in `transformers` and use Galeras dataset for sampling the model:
+```Python
+from transformers import AutoTokenizer, AutoModelWithLMHead
+tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small")
+model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small")
+df_sampled_code['size'] =  df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids']))
+df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids']
+```
+or with a `pipeline`:
+```Python
+from transformers import pipeline
+pipe = pipeline("text-generation", model="codeparrot/codeparrot-small")
+outputs = pipe("def hello_world():")
+```
+## Training
+The model was trained on the cleaned [CodeParrot 🦜 dataset](https://huggingface.co/datasets/codeparrot/codeparrot-clean) with the following settings:
+|Config|Value|
+|-------|-----|
+|Batch size| 192 |
+|Context size| 1024 |
+|Training steps| 150'000|
+|Gradient accumulation| 1|
+|Gradient checkpointing| False|
+|Learning rate| 5e-4 |
+|Weight decay | 0.1 |
+|Warmup steps| 2000 |
+|Schedule| Cosine |
+The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.
+## Performance
+We evaluated the model on OpenAI's [HumanEval](https://huggingface.co/datasets/openai_humaneval) benchmark which consists of programming challenges:
+| Metric | Value |
+|-------|-----|
+|pass@1 | 3.80% |
+|pass@10 | 6.57%	 |
+|pass@100 | 12.78% |
+The [pass@k metric](https://huggingface.co/metrics/code_eval) tells the probability that at least one out of k generations passes the tests.
+## Resources
+- Dataset: [full](https://huggingface.co/datasets/codeparrot/codeparrot-clean), [train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train), [valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid)
+- Code: [repository](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot)
+- Spaces: [generation](), [highlighting]()