File size: 5,257 Bytes
3e68e5b
 
 
 
 
 
 
 
 
 
 
 
43515a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
title: RepoSnipy
emoji: πŸ‰
colorFrom: green
colorTo: yellow
sdk: streamlit
sdk_version: 1.31.1
app_file: app.py
pinned: true
license: mit
---

# RepoSnipy πŸ‰
Neural search engine for discovering semantically similar Python repositories on GitHub.

## Demo
**TODO --- Update the gif file!!!**

Searching an indexed repository:

![Search Indexed Repo Demo](assets/search.gif)

## About

RepoSnipy is a neural search engine built with [streamlit](https://github.com/streamlit/streamlit) and [docarray](https://github.com/docarray/docarray). You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
* It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories. 
* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
* It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
* **SimilarityCal --- TODO update!!!**

We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).

## Installation

### Prerequisites
* Python 3.11
* pip

### Installation with code
We recommend to install first a [conda](https://conda.io/projects/conda/en/latest/index.html) environment with `python 3.11`. Then, you can download the repository. See below:
```bash
conda create --name py311 python=3.11
conda activate py311
git clone https://github.com/RepoMining/RepoSnipy
```
After downloading the repository, you need install the required package. **Make sure the python and pip you used are both from conda environment!**
For the following:
```bash
cd RepoSnipy
pip install -r requirements.txt
```

### Usage
Then run the app on your local machine using:
```bash
streamlit run app.py
```
or
```bash
python -m streamlit run app.py
```
Importantly, to avoid unnecessary conflict (like version conflict, or package location conflict), you should ensure that **streamlit you used is from conda environment**!

### Dataset
As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset and [KMeans](data/kmeans_model_scibert.pkl) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.

To provide research-oriented meaning, we have provided the following scripts for you to recreate them:
```bash
cd data
python create_index.py  # For creating vector dataset (binary files)
python generate_cluster.py  # For creating useful cluster model and information (KMeans model and json files representing repo-clusters)
```

More details can refer to these two scripts above. When you run scripts above, you will get the following files:
1. Generated by [create_index.py](data/create_index.py):
```bash
repositories.txt  # the original repositories file
invalid_repositories.txt  # the invalid repositories file, including invalid repositories
filtered_repositories.txt  # the final repositories file, removing duplicated and invalid repositories
index{i}_{i * target_sub_length}.bin  # the sub-index files, where i means number of sub-repositories and target_sub_length means sub-repositories length
index.bin  # the index file merged by sub-index files and removed numpy zero arrays
```
2. Generated by [generate_cluster.py](data/generate_cluster.py):
```
repo_clusters.json  # a json file representing repo-cluster dictionary
kmeans_model_scibert.pkl  # a pickle file for storing kmeans model based on topic embeddings generated by scibert model
```


## Evaluation
**TODO ---- update!!!**

The [evaluation script](evaluate.py) finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for `python` and `python3`). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from [here](https://huggingface.co/datasets/Lazyhope/RepoSnipy_eval/tree/main), including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.

## License

Distributed under the MIT License. See [LICENSE](LICENSE) for more information.

## Acknowledgments

The model and the fine-tuning dataset used:

* [UniXCoder](https://arxiv.org/abs/2203.03850)
* [AdvTest](https://arxiv.org/abs/1909.09436)
* [SciBERT](https://arxiv.org/abs/1903.10676)