⭐ Memorization or Generation of Big Code Models Leaderboard

Inspired from the 🤗 Open LLM Leaderboard and 🤗 Open LLM-Perf Leaderboard 🏋️, we compare performance of base code generation models on HumanEval and HumanEval-ET benchamrk. We also measure Memorization-Generalization Index and provide information about the models. We compare both open and closed pre-trained code models, that people can start from as base models for their trainings.

Model	MGI	Pass@1(temp=0)		Pass@1(temp=0.8)
Model	MGI	HumanEval	HumanEval-ET	HumanEval	HumanEval-ET

Notes

MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.

Context

The growing number of code models released by the community necessitates a comprehensive evaluation to reliably benchmark their capabilities. Similar to the 🤗 Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:

HumanEval - benchmark for measuring functional correctness for synthesizing programs from docstrings. It consists of 164 Python programming problems.
HumanEval-ET.
MGI - In addition to these benchmarks, we also measure Memorization-Generalization Index.

How to submit models/results to the leaderboard?

We welcome the community to submit evaluation results of new models. These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.

1 - Running Evaluation

We wrote a detailed guide for running the evaluation on your model. You can find the it in bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in addition to the raw generations and metric files.

2- Submitting Results 🚀

To submit your results create a Pull Request in the community tab to add them under the folder community_results in this repository:

Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl
Put your json file with grouped scores from the guide, in addition generations folder and metrics folder in it.

The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace org and model with those corresponding to the model you evaluated.