HenryStephen
commited on
Commit
β’
eea0bde
1
Parent(s):
dd1a4ad
Update README.md
Browse files
README.md
CHANGED
@@ -1,16 +1,5 @@
|
|
1 |
-
---
|
2 |
-
title: RepoSnipy
|
3 |
-
emoji: π
|
4 |
-
colorFrom: green
|
5 |
-
colorTo: yellow
|
6 |
-
sdk: streamlit
|
7 |
-
sdk_version: 1.31.1
|
8 |
-
app_file: app.py
|
9 |
-
pinned: true
|
10 |
-
license: mit
|
11 |
-
---
|
12 |
-
|
13 |
# RepoSnipy π
|
|
|
14 |
Neural search engine for discovering semantically similar Python repositories on GitHub.
|
15 |
|
16 |
## Demo
|
@@ -29,71 +18,16 @@ Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalys
|
|
29 |
* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
|
30 |
* It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
|
31 |
* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
|
32 |
-
*
|
|
|
|
|
33 |
|
34 |
We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
|
35 |
|
36 |
-
## Installation
|
37 |
-
|
38 |
-
### Prerequisites
|
39 |
-
* Python 3.11
|
40 |
-
* pip
|
41 |
-
|
42 |
-
### Installation with code
|
43 |
-
We recommend to install first a [conda](https://conda.io/projects/conda/en/latest/index.html) environment with `python 3.11`. Then, you can download the repository. See below:
|
44 |
-
```bash
|
45 |
-
conda create --name py311 python=3.11
|
46 |
-
conda activate py311
|
47 |
-
git clone https://github.com/RepoMining/RepoSnipy
|
48 |
-
```
|
49 |
-
After downloading the repository, you need install the required package. **Make sure the python and pip you used are both from conda environment!**
|
50 |
-
For the following:
|
51 |
-
```bash
|
52 |
-
cd RepoSnipy
|
53 |
-
pip install -r requirements.txt
|
54 |
-
```
|
55 |
-
|
56 |
-
### Usage
|
57 |
-
Then run the app on your local machine using:
|
58 |
-
```bash
|
59 |
-
streamlit run app.py
|
60 |
-
```
|
61 |
-
or
|
62 |
-
```bash
|
63 |
-
python -m streamlit run app.py
|
64 |
-
```
|
65 |
-
Importantly, to avoid unnecessary conflict (like version conflict, or package location conflict), you should ensure that **streamlit you used is from conda environment**!
|
66 |
-
|
67 |
-
### Dataset
|
68 |
-
As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset and [KMeans](data/kmeans_model_scibert.pkl) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
|
69 |
-
|
70 |
-
To provide research-oriented meaning, we have provided the following scripts for you to recreate them:
|
71 |
-
```bash
|
72 |
-
cd data
|
73 |
-
python create_index.py # For creating vector dataset (binary files)
|
74 |
-
python generate_cluster.py # For creating useful cluster model and information (KMeans model and json files representing repo-clusters)
|
75 |
-
```
|
76 |
-
|
77 |
-
More details can refer to these two scripts above. When you run scripts above, you will get the following files:
|
78 |
-
1. Generated by [create_index.py](data/create_index.py):
|
79 |
-
```bash
|
80 |
-
repositories.txt # the original repositories file
|
81 |
-
invalid_repositories.txt # the invalid repositories file, including invalid repositories
|
82 |
-
filtered_repositories.txt # the final repositories file, removing duplicated and invalid repositories
|
83 |
-
index{i}_{i * target_sub_length}.bin # the sub-index files, where i means number of sub-repositories and target_sub_length means sub-repositories length
|
84 |
-
index.bin # the index file merged by sub-index files and removed numpy zero arrays
|
85 |
-
```
|
86 |
-
2. Generated by [generate_cluster.py](data/generate_cluster.py):
|
87 |
-
```
|
88 |
-
repo_clusters.json # a json file representing repo-cluster dictionary
|
89 |
-
kmeans_model_scibert.pkl # a pickle file for storing kmeans model based on topic embeddings generated by scibert model
|
90 |
-
```
|
91 |
-
|
92 |
|
93 |
-
##
|
94 |
-
|
95 |
|
96 |
-
The [evaluation script](evaluate.py) finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for `python` and `python3`). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from [here](https://huggingface.co/datasets/Lazyhope/RepoSnipy_eval/tree/main), including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.
|
97 |
|
98 |
## License
|
99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# RepoSnipy π
|
2 |
+
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/Henry65/RepoSnipy)
|
3 |
Neural search engine for discovering semantically similar Python repositories on GitHub.
|
4 |
|
5 |
## Demo
|
|
|
18 |
* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
|
19 |
* It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
|
20 |
* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
|
21 |
+
* It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster.
|
22 |
+
More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
|
23 |
+
The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
|
24 |
|
25 |
We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
+
## Dataset
|
29 |
+
As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset, [KMeans](data/kmeans_model_scibert.pkl) model and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
|
30 |
|
|
|
31 |
|
32 |
## License
|
33 |
|