RepoSnipy 🐉

Neural search engine for discovering semantically similar Python repositories on GitHub.

Demo

TODO --- Update the gif file!!!

Searching an indexed repository:

About

RepoSnipy is a neural search engine built with streamlit and docarray. You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

Compared to the previous generation of RepoSnipy, the latest version has such new features below:

It uses the RepoSim4Py, which is based on RepoSim4Py pipeline, to create multi-level embeddings for Python repositories.
Multi-level embeddings --- code, docstring, readme, requirement, and repository.
It uses the SciBERT model to analyse repository topics and to generate embeddings for topics.
Transfer multiple topics into one cluster --- it uses a KMeans model to analyse topic embeddings and to cluster repositories based on topics.
It uses the SimilarityCal model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster. More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above. The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.

We have created a vector dataset (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a json dataset (stored repo-cluster as key-values).

Dataset

As mentioned above, RepoSnipy needs vector, json dataset, KMeans model and SimilarityCal model when you start up it. For your convenience, we have uploaded them in the folder data of this repository.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

The model and the fine-tuning dataset used: