metadata

title: README
emoji: 📉
colorFrom: yellow
colorTo: indigo
sdk: static
pinned: false

RepoFusion: Training Code Models to Understand Your Repository

Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak

This space contains the released resources for our paper "RepoFusion: Training Code Models to Understand Your Repository". A block diagram of our approach can be found below. For more details, refer to the paper.

Data

Stack-Repo can be downloaded from the Datasets section of this space. It contains three folders corresponding to our train, validation and test splits. Each split contains separate folder for a repository where each repository contains all .java files in the repository in the original directory structure along with three .json files corresponding to the Prompt Proposal, BM25 and RandomNN repo contexts. Please see the README for the Datasets section for organization and details of accessing our dataset.

Trained Checkpoints

The trained checkpoints can be downloaded from the Models. We have released the following checkpoints:

RepoFusion_PP_contexts: RepoFusion model trained with prompt proposal repo contexts. This is our best performing model.
RepoFusion_BM25_contexts: RepoFusion model trained with BM25 repo contexts.
RepoFusion_RandomNN_contexts: RepoFusion model trained with RandomNN repo contexts.
finetuned_codet5base: Our finetuned CodeT5-base model. This was used as initialization for our RepoFusion models.
finetuned_codet5large: Our finetuned CodeT5-large model. This was used as a baseline.

Code

We will be releasing the code for training and evaluating RepoFusion, finetuning CodeT5, and details of running the scripts shortly. Watch out this space for updates.