Memorization or Generation of Big Code Models Leaderboard

Inspired from the 🤗 Open LLM Leaderboard and 🤗 Open LLM-Perf Leaderboard 🏋️, we compare performance of base code generation models on HumanEval and HumanEval-ET benchamrk. We also measure Memorization-Generalization Index and provide information about the models. We compare both open and closed pre-trained code models, that people can start from as base models for their trainings.

Model MGI Pass@1(temp=0) Pass@1(temp=0.8)
HumanEval HumanEval-ET HumanEval HumanEval-ET

Notes

Pass@1 (temp = 0) Pass@1 (temp = 0.8)
Pass@1 (temp = 0) Pass@1 (temp = 0.8)

Context

The growing number of code models released by the community necessitates a comprehensive evaluation to reliably benchmark their capabilities. Similar to the 🤗 Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:

How to submit models/results to the leaderboard?

We welcome the community to submit evaluation results of new models. These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.

1 - Running Evaluation

We wrote a detailed guide for running the evaluation on your model. You can find the it in bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in addition to the raw generations and metric files.

2- Submitting Results 🚀

To submit your results create a Pull Request in the community tab to add them under the folder community_results in this repository:

The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace org and model with those corresponding to the model you evaluated.