File size: 3,043 Bytes
eb8f031
 
 
 
 
 
 
 
 
 
 
 
43515a8
eea0bde
43515a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eea0bde
 
 
43515a8
 
 
 
eea0bde
 
43515a8
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
title: RepoSnipy
emoji: πŸ‰
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.31.1
app_file: app.py
pinned: true
license: mit
---

# RepoSnipy πŸ‰
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/Henry65/RepoSnipy)
Neural search engine for discovering semantically similar Python repositories on GitHub.

## Demo
**TODO --- Update the gif file!!!**

Searching an indexed repository:

![Search Indexed Repo Demo](assets/search.gif)

## About

RepoSnipy is a neural search engine built with [streamlit](https://github.com/streamlit/streamlit) and [docarray](https://github.com/docarray/docarray). You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
* It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories. 
* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
* It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
* It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster.
More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.

We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).


## Dataset
As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset, [KMeans](data/kmeans_model_scibert.pkl) model and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.


## License

Distributed under the MIT License. See [LICENSE](LICENSE) for more information.

## Acknowledgments

The model and the fine-tuning dataset used:

* [UniXCoder](https://arxiv.org/abs/2203.03850)
* [AdvTest](https://arxiv.org/abs/1909.09436)
* [SciBERT](https://arxiv.org/abs/1903.10676)