Inspired from the 🤗 Open LLM Leaderboard and 🤗 Open LLM-Perf Leaderboard 🏋️, we compare performance of base code generation models on HumanEval and HumanEval-ET benchamrk. We also measure Memorization-Generalization Index and provide information about the models. We compare both open and closed pre-trained code models, that people can start from as base models for their trainings.
Model | MGI | Pass@1(temp=0) | Pass@1(temp=0.8) |
---|---|---|---|
HumanEval | HumanEval-ET | HumanEval | HumanEval-ET |
Notes
The growing number of code models released by the community necessitates a comprehensive evaluation to reliably benchmark their capabilities. Similar to the 🤗 Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:
We welcome the community to submit evaluation results of new models. These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
We wrote a detailed guide for running the evaluation on your model. You can find the it in bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in addition to the raw generations and metric files.
To submit your results create a Pull Request in the community tab to add them under the folder community_results in this repository:
The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace org and model with those corresponding to the model you evaluated.