RepoSnipy 🐉

Neural search engine for discovering semantically similar Python repositories on GitHub.

Demo

TODO --- Update the gif file!!!

Searching an indexed repository:

About

RepoSnipy is a neural search engine built with streamlit and docarray. You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

Compared to the previous generation of RepoSnipy, the latest version has such new features below:

It uses the RepoSim4Py, which is based on RepoSim4Py pipeline, to create multi-level embeddings for Python repositories.
Multi-level embeddings --- code, docstring, readme, requirement, and repository.
It uses the SciBERT model to analyse repository topics and to generate embeddings for topics.
Transfer multiple topics into one cluster --- it uses a KMeans model to analyse topic embeddings and to cluster repositories based on topics.
SimilarityCal --- TODO update!!!

We have created a vector dataset (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a json dataset (stored repo-cluster as key-values).

Installation

Prerequisites

Python 3.11
pip

Installation with code

We recommend to install first a conda environment with python 3.11. Then, you can download the repository. See below:

conda create --name py311 python=3.11
conda activate py311
git clone https://github.com/RepoMining/RepoSnipy

After downloading the repository, you need install the required package. Make sure the python and pip you used are both from conda environment! For the following:

cd RepoSnipy
pip install -r requirements.txt

Usage

Then run the app on your local machine using:

streamlit run app.py

python -m streamlit run app.py

Importantly, to avoid unnecessary conflict (like version conflict, or package location conflict), you should ensure that streamlit you used is from conda environment!

Dataset

As mentioned above, RepoSnipy needs vector, json dataset and KMeans model when you start up it. For your convenience, we have uploaded them in the folder data of this repository.

To provide research-oriented meaning, we have provided the following scripts for you to recreate them:

cd data
python create_index.py  # For creating vector dataset (binary files)
python generate_cluster.py  # For creating useful cluster model and information (KMeans model and json files representing repo-clusters)

More details can refer to these two scripts above. When you run scripts above, you will get the following files:

Generated by create_index.py:

repositories.txt  # the original repositories file
invalid_repositories.txt  # the invalid repositories file, including invalid repositories
filtered_repositories.txt  # the final repositories file, removing duplicated and invalid repositories
index{i}_{i * target_sub_length}.bin  # the sub-index files, where i means number of sub-repositories and target_sub_length means sub-repositories length
index.bin  # the index file merged by sub-index files and removed numpy zero arrays

Generated by generate_cluster.py:

repo_clusters.json  # a json file representing repo-cluster dictionary
kmeans_model_scibert.pkl  # a pickle file for storing kmeans model based on topic embeddings generated by scibert model

Evaluation

TODO ---- update!!!

The evaluation script finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for python and python3). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from here, including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

The model and the fine-tuning dataset used: