Spaces:

Henry65
/

RepoSnipy

Sleeping

App Files Files Community

RepoSnipy / README.md

HenryStephen

Update README.md

eea0bde 9 months ago

preview code

raw

history blame

2.9 kB

	# RepoSnipy 🐉
	[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/Henry65/RepoSnipy)
	Neural search engine for discovering semantically similar Python repositories on GitHub.

	## Demo
	TODO --- Update the gif file!!!

	Searching an indexed repository:

	![Search Indexed Repo Demo](assets/search.gif)

	## About

	RepoSnipy is a neural search engine built with [streamlit](https://github.com/streamlit/streamlit) and [docarray](https://github.com/docarray/docarray). You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

	Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
	* It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories.
	* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
	* It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
	* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
	* It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster.
	More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
	The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.

	We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).


	## Dataset
	As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset, [KMeans](data/kmeans_model_scibert.pkl) model and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.


	## License

	Distributed under the MIT License. See [LICENSE](LICENSE) for more information.

	## Acknowledgments

	The model and the fine-tuning dataset used:

	* [UniXCoder](https://arxiv.org/abs/2203.03850)
	* [AdvTest](https://arxiv.org/abs/1909.09436)
	* [SciBERT](https://arxiv.org/abs/1903.10676)