---
title: README
emoji: 📉
colorFrom: yellow
colorTo: indigo
sdk: static
pinned: false
---

# RepoFusion: Training Code Models to Understand Your Repository
Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak

This space contains the released resources for our paper "RepoFusion: Training Code Models to Understand Your Repository". A block diagram of our approach can be found below. For more details, refer to the paper.

![block diagram](block_diagram.png)

## Data
Stack-Repo can be downloaded from the [Datasets](https://huggingface.co/datasets/RepoFusion/Stack-Repo) section of this space. It contains three folders corresponding to our train, validation and test splits. Each split contains separate folder for a repository where each repository contains all .java files in the repository in the original directory structure along with three .json files corresponding to the Prompt Proposal, BM25 and RandomNN repo contexts. Please see the README for the Datasets section for organization and details of accessing our dataset.

## Trained Checkpoints
The trained checkpoints can be downloaded from the [Models](https://huggingface.co/RepoFusion/trained_checkpoints). We have released the following checkpoints:
- `RepoFusion_PP_contexts`: RepoFusion model trained with prompt proposal repo contexts. This is our best performing model.
- `RepoFusion_BM25_contexts`: RepoFusion model trained with BM25 repo contexts.
- `RepoFusion_RandomNN_contexts`: RepoFusion model trained with RandomNN repo contexts.
- `finetuned_codet5base`: Our finetuned CodeT5-base model. This was used as initialization for our RepoFusion models.
- `finetuned_codet5large`: Our finetuned CodeT5-large model. This was used as a baseline.

## Code
We will be releasing the code for training and evaluating RepoFusion, finetuning CodeT5, and details of running the scripts shortly. Watch out this space for updates.