File size: 11,540 Bytes
6c92442 2fb6d29 6c92442 6553f1e 84fadb9 6553f1e 84fadb9 6553f1e 84fadb9 6553f1e 84fadb9 6553f1e 946246a 84fadb9 9cb627e 6553f1e fe82027 4b54f71 fe82027 8314e15 84dc4f2 8314e15 20ef3c6 3ae0643 fe82027 6553f1e 946246a fe82027 9cb627e 6553f1e 9cb627e 6553f1e 9cb627e e670655 8314e15 6553f1e 9cb627e 6553f1e 9cb627e 6553f1e 9cb627e 0a28a77 946246a 9cb627e 84fadb9 6553f1e 6c92442 e670655 6c92442 6553f1e 04f40cd e4c0a84 6c92442 84fadb9 6553f1e aa6b5d3 8314e15 6553f1e c7823aa 6553f1e 946246a aa6b5d3 946246a aa6b5d3 84fadb9 6553f1e 8314e15 0fc7c7a 6553f1e 561e145 0fc7c7a 6553f1e 946246a 0fc7c7a 6c92442 84fadb9 6c92442 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
from typing import Optional
TASKS_PRETTY = {
"library_based_code_generation": "Library-based code generation",
"ci_builds_repair": "CI builds repair",
"project_code_completion": "Project-level code completion",
"commit_message_generation": "Commit message generation",
"bug_localization": "Bug localization",
"module_summarization": "Module Summarization",
}
TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
TASKS_DESCRIPTIONS = {
"library_based_code_generation": """# Library-based code generation\n
Our Library-based code generation benchmark π€ [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
For evaluation, we use two metrics:
* `ChrF`: textual similarity between the generated code and the reference program.
* `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at [email protected].
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"ci_builds_repair": """# CI builds repair\n
Our CI builds repair benchmark π€ [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix.
Models can be evaluated in three settings:
* `full` β **no** ground truth diffs are used for model evaluation;
* `oracle: files` β ground truth diffs are used to select files that should be corrected to fix the issue;
* `oracle: files, lines` β ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at [email protected].
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"project_code_completion": """# Project-level code completion\n
Our Project-level code completion benchmark π€ [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples:
* `small-context`: 144 data points,
* `medium-context`: 224 data points,
* `large-context`: 270 data points,
* `huge-context`: 296 data points.
Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below),
and a repository snapshot that can be used to build the context.
We use standard `Exact Match (EM)` metric for one-line code completion.
We evaluate `Exact Match` for different line categories:
* *infile* β functions and classes are from the completion file;
* *inproject* β functions and files are from the repository snapshot at the moment of completion;
* *committed* β functions and classes are from the files that were added on the completion file commit;
* *common* β functions and classes with common names, e.g., `main`, `get`, etc.;
* *non-informative* β short/long lines, import/print lines, or comment lines;
* *random* β lines that don't fit any of the previous categories.
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at [email protected].
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"commit_message_generation": """# Commit message generation\n
Our Commit message generation benchmark π€ [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.
We use the following metrics for evaluation:
* [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
* [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)
* [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
* [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
**Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
If you have any questions or requests concerning this dataset, please contact us at [email protected].
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"bug_localization": """# Bug localization\n
Our Bug localization benchmark π€ [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
The model needs to identify the files within the repository that need to be modified to address the reported bug.
We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
If you have any questions or requests concerning this dataset, please contact us at [email protected].
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
"module_summarization": """# Module summarization\n
Our Module summarization benchmark π€ [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
The model is required to generate such description, given the relevant context code and the intent behind the documentation.
We use a novel metric for evaluation:
* `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
For further details on the dataset and the baselines from the ποΈ Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/).
If you have any questions or requests concerning this dataset, please contact us at [email protected].
**Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
""",
}
def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str:
if not task_pretty:
return "Please, select a specific task to see more detailed instructions regarding submitting files."
task_id = TASKS_PRETTY_REVERSE[task_pretty]
if task_id == "commit_message_generation":
return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by ποΈ Long Code Arena Team in π€ [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""
return f"**{task_pretty} Instructions:**\n\n* π§ There are no instructions for the current task yet."
|