Spaces:

JetBrains-Research
/

long-code-arena

Running

File size: 11,540 Bytes

6c92442
 
 
2fb6d29
 
 
 
 
 
6c92442
 
 
 
6553f1e
84fadb9
6553f1e
84fadb9
6553f1e
84fadb9
6553f1e
84fadb9
6553f1e
946246a
 
 
 
84fadb9
9cb627e
6553f1e
fe82027
4b54f71
 
fe82027
8314e15
 
84dc4f2
 
8314e15
20ef3c6
3ae0643
 
fe82027
6553f1e
946246a
 
 
 
fe82027
9cb627e
6553f1e
9cb627e
6553f1e
9cb627e
 
 
 
e670655
8314e15
 
 
6553f1e
 
9cb627e
6553f1e
9cb627e
 
 
6553f1e
9cb627e
0a28a77
946246a
 
 
 
9cb627e
84fadb9
6553f1e
6c92442
e670655
6c92442
 
 
 
 
 
 
6553f1e
04f40cd
e4c0a84
 
 
 
 
 
6c92442
84fadb9
6553f1e
aa6b5d3
8314e15
 
6553f1e
c7823aa
6553f1e
946246a
 
aa6b5d3
946246a
aa6b5d3
84fadb9
6553f1e
 
8314e15
0fc7c7a
6553f1e
561e145
0fc7c7a
6553f1e
946246a
 
 
 
0fc7c7a
6c92442
 
 
 
 
 
 
 
 
 
 
84fadb9
6c92442

from typing import Optional

TASKS_PRETTY = {
    "library_based_code_generation": "Library-based code generation",
    "ci_builds_repair": "CI builds repair",
    "project_code_completion": "Project-level code completion",
    "commit_message_generation": "Commit message generation",
    "bug_localization": "Bug localization",
    "module_summarization": "Module Summarization",
}
TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}

TASKS_DESCRIPTIONS = {
    "library_based_code_generation": """# Library-based code generation\n
        
        Our Library-based code generation benchmark 🤗 [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
        
        For evaluation, we use two metrics:
        * `ChrF`: textual similarity between the generated code and the reference program.  
        * `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,  

        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
        
        If you have any questions or requests concerning this dataset, please contact us at [email protected].
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,

    "ci_builds_repair": """# CI builds repair\n
        
        Our CI builds repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair) 
        includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.

        The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
        and then the benchmark pushes the repo to GitHub and requests the result of the GitHub CI.
        We use the `Pass@1` rate metric to measure CI repair, indicating the ratio of data points, for which the build passed successfully after the generated fix. 
        
        Models can be evaluated in three settings:
        * `full` – **no** ground truth diffs are used for model evaluation;
        * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
        * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;

        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
        
        If you have any questions or requests concerning this dataset, please contact us at [email protected].
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,

    "project_code_completion": """# Project-level code completion\n
        
        Our Project-level code completion benchmark 🤗 [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples:
        * `small-context`: 144 data points,
        * `medium-context`: 224 data points,
        * `large-context`: 270 data points,
        * `huge-context`: 296 data points.
        
        Each data point contains the file for completion, a list of lines to complete with their categories (see the categorization below), 
        and a repository snapshot that can be used to build the context.
        
        We use standard `Exact Match (EM)` metric for one-line code completion.
        We evaluate `Exact Match` for different line categories:
        * *infile* – functions and classes are from the completion file;
        * *inproject* – functions and files are from the repository snapshot at the moment of completion;
        * *committed* – functions and classes are from the files that were added on the completion file commit;
        * *common* – functions and classes with common names, e.g., `main`, `get`, etc.;
        * *non-informative* – short/long lines, import/print lines, or comment lines;
        * *random* – lines that don't fit any of the previous categories.

        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
        
        If you have any questions or requests concerning this dataset, please contact us at [email protected].
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,

    "commit_message_generation": """# Commit message generation\n
        
        Our Commit message generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.  
        
        We use the following metrics for evaluation:
        * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
        * [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)
        * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
        * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
        
        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
        
        **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default. 

        If you have any questions or requests concerning this dataset, please contact us at [email protected].
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 

        """,

    "bug_localization": """# Bug localization\n
        
        Our Bug localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
        The model needs to identify the files within the repository that need to be modified to address the reported bug.
        We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.

        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).

        If you have any questions or requests concerning this dataset, please contact us at [email protected].
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
    """,

    "module_summarization": """# Module summarization\n
        Our Module summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects. 
        The model is required to generate such description, given the relevant context code and the intent behind the documentation.

        We use a novel metric for evaluation:
        * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).

        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/).
        
        If you have any questions or requests concerning this dataset, please contact us at [email protected].
        
        **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)). 
        """,
}


def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str:
    if not task_pretty:
        return "Please, select a specific task to see more detailed instructions regarding submitting files."

    task_id = TASKS_PRETTY_REVERSE[task_pretty]

    if task_id == "commit_message_generation":
        return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by 🏟️ Long Code Arena Team in  🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""

    return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."