Spaces:

Henry65
/

RepoSnipy

Sleeping

App Files Files Community

HenryStephen commited on Mar 3, 2024

Commit

43515a8

1 Parent(s): c831d35

Deploying RepoSnipy

Browse files

Files changed (15) hide show

.gitattributes +1 -0
.gitignore +163 -0
LICENSE +21 -0
README.md +96 -13
app.py +442 -0
assets/search.gif +3 -0
data/SimilarityCal_model_NO1.pt +3 -0
data/index.bin +3 -0
data/index_test.bin +3 -0
data/kmeans_model_scibert.pkl +3 -0
data/pair_classifier.py +37 -0
data/repo_clusters.json +0 -0
data/repo_clusters_test.json +0 -0
data/repo_doc.py +18 -0
requirements.txt +12 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.gif filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,163 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Streamlit configs
+.streamlit/

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 RepoSnipy
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,13 +1,96 @@
----
-title: RepoSnipy
-emoji: 📈
-colorFrom: yellow
-colorTo: red
-sdk: streamlit
-sdk_version: 1.31.1
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# RepoSnipy 🐉
+Neural search engine for discovering semantically similar Python repositories on GitHub.
+## Demo
+**TODO --- Update the gif file!!!**
+Searching an indexed repository:
+![Search Indexed Repo Demo](assets/search.gif)
+## About
+RepoSnipy is a neural search engine built with [streamlit](https://github.com/streamlit/streamlit) and [docarray](https://github.com/docarray/docarray). You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.
+Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
+* It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories.
+* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
+* It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
+* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
+* **SimilarityCal --- TODO update!!!**
+We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
+## Installation
+### Prerequisites
+* Python 3.11
+* pip
+### Installation with code
+We recommend to install first a [conda](https://conda.io/projects/conda/en/latest/index.html) environment with `python 3.11`. Then, you can download the repository. See below:
+```bash
+conda create --name py311 python=3.11
+conda activate py311
+git clone https://github.com/RepoMining/RepoSnipy
+```
+After downloading the repository, you need install the required package. **Make sure the python and pip you used are both from conda environment!**
+For the following:
+```bash
+cd RepoSnipy
+pip install -r requirements.txt
+```
+### Usage
+Then run the app on your local machine using:
+```bash
+streamlit run app.py
+```
+or
+```bash
+python -m streamlit run app.py
+```
+Importantly, to avoid unnecessary conflict (like version conflict, or package location conflict), you should ensure that **streamlit you used is from conda environment**!
+### Dataset
+As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset and [KMeans](data/kmeans_model_scibert.pkl) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
+To provide research-oriented meaning, we have provided the following scripts for you to recreate them:
+```bash
+cd data
+python create_index.py  # For creating vector dataset (binary files)
+python generate_cluster.py  # For creating useful cluster model and information (KMeans model and json files representing repo-clusters)
+```
+More details can refer to these two scripts above. When you run scripts above, you will get the following files:
+1. Generated by [create_index.py](data/create_index.py):
+```bash
+repositories.txt  # the original repositories file
+invalid_repositories.txt  # the invalid repositories file, including invalid repositories
+filtered_repositories.txt  # the final repositories file, removing duplicated and invalid repositories
+index{i}_{i * target_sub_length}.bin  # the sub-index files, where i means number of sub-repositories and target_sub_length means sub-repositories length
+index.bin  # the index file merged by sub-index files and removed numpy zero arrays
+```
+2. Generated by [generate_cluster.py](data/generate_cluster.py):
+```
+repo_clusters.json  # a json file representing repo-cluster dictionary
+kmeans_model_scibert.pkl  # a pickle file for storing kmeans model based on topic embeddings generated by scibert model
+```
+## Evaluation
+**TODO ---- update!!!**
+The [evaluation script](evaluate.py) finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for `python` and `python3`). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from [here](https://huggingface.co/datasets/Lazyhope/RepoSnipy_eval/tree/main), including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.
+## License
+Distributed under the MIT License. See [LICENSE](LICENSE) for more information.
+## Acknowledgments
+The model and the fine-tuning dataset used:
+* [UniXCoder](https://arxiv.org/abs/2203.03850)
+* [AdvTest](https://arxiv.org/abs/1909.09436)
+* [SciBERT](https://arxiv.org/abs/1903.10676)

app.py ADDED Viewed

	@@ -0,0 +1,442 @@

+import re
+import json
+import nltk
+import joblib
+import torch
+import pandas as pd
+import numpy as np
+import streamlit as st
+from pathlib import Path
+from torch import nn
+from docarray import DocList
+from docarray.index import InMemoryExactNNIndex
+from transformers import pipeline
+from transformers import AutoTokenizer, AutoModel
+from data.repo_doc import RepoDoc
+from data.pair_classifier import PairClassifier
+from nltk.stem import WordNetLemmatizer
+nltk.download("wordnet")
+KMEANS_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_scibert.pkl")
+SIMILARITY_CAL_MODEL_PATH = Path(__file__).parent.joinpath("data/SimilarityCal_model_NO1.pt")
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+# 1. Product environment
+# INDEX_PATH = Path(__file__).parent.joinpath("data/index.bin")
+# CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_clusters.json")
+SCIBERT_MODEL_PATH = "allenai/scibert_scivocab_uncased"
+# 2. Developing environment
+INDEX_PATH = Path(__file__).parent.joinpath("data/index_test.bin")
+CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_clusters_test.json")
+# SCIBERT_MODEL_PATH = Path(__file__).parent.joinpath("data/scibert_scivocab_uncased")  # Download locally
+@st.cache_resource(show_spinner="Loading repositories basic information...")
+def load_index():
+    """
+    The function to load the index file and return a RepoDoc object with default value
+    :return: index and a RepoDoc object with default value
+    """
+    default_doc = RepoDoc(
+        name="",
+        topics=[],
+        stars=0,
+        license="",
+        code_embedding=None,
+        doc_embedding=None,
+        readme_embedding=None,
+        requirement_embedding=None,
+        repository_embedding=None
+    )
+    return InMemoryExactNNIndex[RepoDoc](index_file_path=INDEX_PATH), default_doc
+@st.cache_resource(show_spinner="Loading repositories clusters...")
+def load_repo_clusters():
+    """
+    The function to load the repo-clusters file
+    :return: a dictionary with the repo-clusters
+    """
+    with open(CLUSTER_PATH, "r") as file:
+        repo_clusters = json.load(file)
+    return repo_clusters
+@st.cache_resource(show_spinner="Loading RepoSim4Py pipeline model...")
+def load_pipeline_model():
+    """
+    The function to load RepoSim4Py pipeline model
+    :return: a HuggingFace pipeline
+    """
+    # Option 1 --- Download model by HuggingFace username/model_name
+    model_path = "Henry65/RepoSim4Py"
+    # Option 2 --- Download model locally
+    # model_path = Path(__file__).parent.joinpath("data/RepoSim4Py")
+    return pipeline(
+        model=model_path,
+        trust_remote_code=True,
+        device_map="auto"
+    )
+@st.cache_resource(show_spinner="Loading SciBERT model...")
+def load_scibert_model():
+    """
+    The function to load SciBERT model
+    :return: tokenizer and model
+    """
+    tokenizer = AutoTokenizer.from_pretrained(SCIBERT_MODEL_PATH)
+    scibert_model = AutoModel.from_pretrained(SCIBERT_MODEL_PATH).to(device)
+    return tokenizer, scibert_model
+@st.cache_resource(show_spinner="Loading KMeans model...")
+def load_kmeans_model():
+    """
+    The function to load KMeans model
+    :return: a KMeans model
+    """
+    return joblib.load(KMEANS_MODEL_PATH)
+@st.cache_resource(show_spinner="Loading SimilarityCal model...")
+def load_similaritycal_model():
+    sim_cal_model = PairClassifier()
+    sim_cal_model.load_state_dict(torch.load(SIMILARITY_CAL_MODEL_PATH))
+    sim_cal_model = sim_cal_model.to(device)
+    sim_cal_model = sim_cal_model.eval()
+    return sim_cal_model
+def generate_scibert_embedding(tokenizer, scibert_model, text):
+    """
+    The function for generating SciBERT embeddings based on topic text
+    :param text: the topic text
+    :return: topic embeddings
+    """
+    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
+    outputs = scibert_model(**inputs)
+    # Use mean pooling for sentence representation
+    embeddings = outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
+    return embeddings
+@st.cache_data(show_spinner=False)
+def run_pipeline_model(_model, repo_name, github_token):
+    """
+    The function to generate repo_info by using pipeline model
+    :param _model: pipeline
+    :param repo_name: the name of repository
+    :param github_token: GitHub token
+    :return: the information generated by the pipeline
+    """
+    with st.spinner(
+            f"Downloading and extracting the {repo_name}, this may take a while..."
+    ):
+        extracted_infos = _model.preprocess(repo_name, github_token=github_token)
+    if not extracted_infos:
+        return None
+    with st.spinner(f"Generating embeddings for {repo_name}..."):
+        repo_info = _model.forward(extracted_infos)[0]
+    return repo_info
+def run_index_search(index, query, search_field, limit):
+    """
+    The function to search at index file based on query and limit
+    :param index: the index
+    :param query: query
+    :param search_field: which field to search for
+    :param limit: page limit
+    :return: a dataframe with search results
+    """
+    top_matches, scores = index.find(
+        query=query, search_field=search_field, limit=limit
+    )
+    search_results = top_matches.to_dataframe()
+    search_results["scores"] = scores
+    return search_results
+def run_cluster_search(repo_clusters, repo_name_list):
+    """
+    The function to search cluster number for such repositories.
+    :param repo_clusters: dictionary with repo-clusters
+    :param repo_name_list: list or array represent repository names
+    :return: cluster number list
+    """
+    clusters = []
+    for repo_name in repo_name_list:
+        clusters.append(repo_clusters[repo_name])
+    return clusters
+def run_similaritycal_search(index, repo_clusters, model, query_doc, query_cluster_number, limit, same_cluster=True):
+    """
+    The function to run SimilarityCal model.
+    :param index: index file
+    :param repo_clusters: repo-clusters json file
+    :param model: SimilarityCal model
+    :param query_doc: query repo doc
+    :param query_cluster_number: query repo cluster number
+    :param limit: limit
+    :param same_cluster: whether searching for same cluster
+    :return: result dataframe
+    """
+    docs = index._docs
+    input_embeddings_list = []
+    result_dl = DocList[RepoDoc]()
+    for doc in docs:
+        if same_cluster and query_cluster_number != repo_clusters[doc.name]:
+            continue
+        if doc.name != query_doc.name:
+            e1, e2 = (torch.Tensor(query_doc.repository_embedding),
+                      torch.Tensor(doc.repository_embedding))
+            input_embeddings = torch.cat([e1, e2])
+            input_embeddings_list.append(input_embeddings)
+            result_dl.append(doc)
+    input_embeddings_list = torch.stack(input_embeddings_list).to(device)
+    softmax = nn.Softmax(dim=1).to(device)
+    model_output = model(input_embeddings_list)
+    similarity_scores = softmax(model_output)[:, 1].cpu().detach().numpy()
+    df = result_dl.to_dataframe()
+    df["scores"] = similarity_scores
+    return df.sort_values(by='scores', ascending=False).reset_index(drop=True).head(limit)
+if __name__ == "__main__":
+    # Loading dataset and models
+    index, default_doc = load_index()
+    repo_clusters = load_repo_clusters()
+    pipeline_model = load_pipeline_model()
+    lemmatizer = WordNetLemmatizer()
+    tokenizer, scibert_model = load_scibert_model()
+    kmeans = load_kmeans_model()
+    sim_cal_model = load_similaritycal_model()
+    # Setting the sidebar
+    with st.sidebar:
+        st.text_input(
+            label="GitHub Token",
+            key="github_token",
+            type="password",
+            placeholder="Paste your GitHub token here",
+            help="Consider setting GitHub token to avoid hitting rate limits: https://docs.github.com/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token",
+        )
+        st.slider(
+            label="Search results limit",
+            min_value=1,
+            max_value=100,
+            value=10,
+            step=1,
+            key="search_results_limit",
+            help="Limit the number of search results",
+        )
+        st.multiselect(
+            label="Display columns",
+            options=["scores", "name", "topics", "cluster number", "stars", "license"],
+            default=["scores", "name", "topics", "cluster number", "stars", "license"],
+            help="Select columns to display in the search results",
+            key="display_columns",
+        )
+    # Setting the main content
+    st.title("RepoSnipy")
+    st.text_input(
+        "Enter a GitHub repository URL or owner/repository (case-sensitive):",
+        value="",
+        max_chars=200,
+        placeholder="numpy/numpy",
+        key="repo_input",
+    )
+    st.checkbox(
+        label="Add/Update this repository to the index",
+        value=False,
+        key="update_index",
+        help="Encode the latest version of this repository and add/update it to the index",
+    )
+    # Setting the search button
+    search = st.button("Search")
+    # The regular expression for repository
+    repo_regex = r"^((git@|http(s)?://)?(github\.com)(/|:))?(?P<owner>[\w.-]+)(/)(?P<repo>[\w.-]+?)(\.git)?(/)?$"
+    if search:
+        match_res = re.match(repo_regex, st.session_state.repo_input)
+        # 1. Repository can be matched
+        if match_res is not None:
+            repo_name = f"{match_res.group('owner')}/{match_res.group('repo')}"
+            records = index.filter({"name": {"$eq": repo_name}})
+            # 1) Building the query information
+            query_doc = default_doc.copy() if not records else records[0]
+            # 2) Recording the cluster number
+            cluster_number = -1 if not records else repo_clusters[repo_name]
+            # Importance 1 ---- situation need to update repository information and cluster number
+            if st.session_state.update_index or not records:
+                # 1) Updating repository information by using RepoSim4Py pipeline
+                repo_info = run_pipeline_model(pipeline_model, repo_name, st.session_state.github_token)
+                if repo_info is None:
+                    st.error("Repository not found or invalid GitHub token!")
+                    st.stop()
+                query_doc.name = repo_info["name"]
+                query_doc.topics = repo_info["topics"]
+                query_doc.stars = repo_info["stars"]
+                query_doc.license = repo_info["license"]
+                query_doc.code_embedding = None if np.all(repo_info["mean_code_embedding"] == 0) else repo_info[
+                    "mean_code_embedding"].reshape(-1)
+                query_doc.doc_embedding = None if np.all(repo_info["mean_doc_embedding"] == 0) else repo_info[
+                    "mean_doc_embedding"].reshape(-1)
+                query_doc.readme_embedding = None if np.all(repo_info["mean_readme_embedding"] == 0) else repo_info[
+                    "mean_readme_embedding"].reshape(-1)
+                query_doc.requirement_embedding = None if np.all(repo_info["mean_requirement_embedding"] == 0) else \
+                    repo_info["mean_requirement_embedding"].reshape(-1)
+                query_doc.repository_embedding = None if np.all(repo_info["mean_repo_embedding"] == 0) else repo_info[
+                    "mean_repo_embedding"].reshape(-1)
+                # 2) Updating cluster number
+                topics_text = ' '.join(
+                    [lemmatizer.lemmatize(topic.lower().replace('-', ' ')) for topic in query_doc.topics])
+                topic_embeddings = generate_scibert_embedding(tokenizer, scibert_model, topics_text)
+                cluster_number = int(kmeans.predict(topic_embeddings)[0])
+            # Importance 2 ---- update index file and repository clusters file
+            if st.session_state.update_index:
+                if not query_doc.license:
+                    st.warning(
+                        "License is missing in this repository and will not be persisted!"
+                    )
+                elif (query_doc.code_embedding is None) and (query_doc.doc_embedding is None) and (
+                        query_doc.requirement_embedding is None) and (query_doc.readme_embedding is None) and (
+                        query_doc.repository_embedding is None):
+                    st.warning(
+                        "This repository has no such useful information (code, docstring, readme and requirement) extracted and will not be persisted!"
+                    )
+                else:
+                    index.index(query_doc)
+                    repo_clusters[query_doc.name] = cluster_number
+                    with st.spinner("Persisting the index and repository clusters..."):
+                        index.persist(str(INDEX_PATH))
+                        with open(CLUSTER_PATH, "w") as file:
+                            json.dump(repo_clusters, file, indent=4)
+                        st.success("Repository updated to the index!")
+                    load_index.clear()
+                    load_repo_clusters.clear()
+            st.session_state["query_doc"] = query_doc
+            st.session_state["cluster_number"] = cluster_number
+        # 2. Repository cannot be matched
+        else:
+            st.error("Invalid input!")
+    # Starting to query
+    if "query_doc" in st.session_state:
+        query_doc = st.session_state.query_doc
+        cluster_number = st.session_state.cluster_number
+        limit = st.session_state.search_results_limit
+        # Showing the query repository information
+        st.dataframe(
+            pd.DataFrame(
+                [
+                    {
+                        "name": query_doc.name,
+                        "topics": query_doc.topics,
+                        "cluster number": cluster_number,
+                        "stars": query_doc.stars,
+                        "license": query_doc.license,
+                    }
+                ],
+            )
+        )
+        display_columns = st.session_state.display_columns
+        code_sim_tab, doc_sim_tab, readme_sim_tab, requirement_sim_tab, repo_sim_tab, same_cluster_tab, diff_cluster_tab = st.tabs(
+            ["Code_sim", "Docstring_sim", "Readme_sim", "Requirement_sim",
+             "Repository_sim", "Same_cluster", "Different_cluster"])
+        if query_doc.code_embedding is not None:
+            code_sim_res = run_index_search(index, query_doc, "code_embedding", limit)
+            cluster_numbers = run_cluster_search(repo_clusters, code_sim_res["name"])
+            code_sim_res["cluster number"] = cluster_numbers
+            code_sim_tab.dataframe(code_sim_res[display_columns])
+        else:
+            code_sim_tab.error("No function code was extracted for this repository!")
+        if query_doc.doc_embedding is not None:
+            doc_sim_res = run_index_search(index, query_doc, "doc_embedding", limit)
+            cluster_numbers = run_cluster_search(repo_clusters, doc_sim_res["name"])
+            doc_sim_res["cluster number"] = cluster_numbers
+            doc_sim_tab.dataframe(doc_sim_res[display_columns])
+        else:
+            doc_sim_tab.error("No function docstring was extracted for this repository!")
+        if query_doc.readme_embedding is not None:
+            readme_sim_res = run_index_search(index, query_doc, "readme_embedding", limit)
+            cluster_numbers = run_cluster_search(repo_clusters, readme_sim_res["name"])
+            readme_sim_res["cluster number"] = cluster_numbers
+            readme_sim_tab.dataframe(readme_sim_res[display_columns])
+        else:
+            readme_sim_tab.error("No readme file was extracted for this repository!")
+        if query_doc.requirement_embedding is not None:
+            requirement_sim_res = run_index_search(index, query_doc, "requirement_embedding", limit)
+            cluster_numbers = run_cluster_search(repo_clusters, requirement_sim_res["name"])
+            requirement_sim_res["cluster number"] = cluster_numbers
+            requirement_sim_tab.dataframe(requirement_sim_res[display_columns])
+        else:
+            requirement_sim_tab.error("No requirement file was extracted for this repository!")
+        if query_doc.repository_embedding is not None:
+            repo_sim_res = run_index_search(index, query_doc, "repository_embedding", limit)
+            cluster_numbers = run_cluster_search(repo_clusters, repo_sim_res["name"])
+            repo_sim_res["cluster number"] = cluster_numbers
+            repo_sim_tab.dataframe(repo_sim_res[display_columns])
+        else:
+            repo_sim_tab.error("No such useful information was extracted for this repository!")
+        if cluster_number is not None and query_doc.repository_embedding is not None:
+            same_cluster_df = run_similaritycal_search(index, repo_clusters, sim_cal_model,
+                                                       query_doc, cluster_number, limit,
+                                                       same_cluster=True)
+            diff_cluster_df = run_similaritycal_search(index, repo_clusters, sim_cal_model,
+                                                       query_doc, cluster_number, limit,
+                                                       same_cluster=False)
+            same_cluster_numbers = run_cluster_search(repo_clusters, same_cluster_df["name"])
+            same_cluster_df["cluster number"] = same_cluster_numbers
+            diff_cluster_numbers = run_cluster_search(repo_clusters, diff_cluster_df["name"])
+            diff_cluster_df["cluster number"] = diff_cluster_numbers
+            same_cluster_tab.dataframe(same_cluster_df[display_columns])
+            diff_cluster_tab.dataframe(diff_cluster_df[display_columns])
+        else:
+            same_cluster_tab.error("No such useful information was extracted for this repository!")
+            diff_cluster_tab.error("No such useful information was extracted for this repository!")

assets/search.gif ADDED Viewed

Git LFS Details

SHA256: 98ca3ea97923fb15842bef8278d55e9255b36750b03f234c649f93ea06ea7842
Pointer size: 132 Bytes
Size of remote file: 6.07 MB

data/SimilarityCal_model_NO1.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9146d0736261db38bb6fe6d4d6dd17797c01980be23b114af4b86a18589af632
+size 102423158

data/index.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3837b4cb3f10cd0ff035201ef44ab655608b2877e5c89efc5cc63a69b666c415
+size 226172318

data/index_test.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3837b4cb3f10cd0ff035201ef44ab655608b2877e5c89efc5cc63a69b666c415
+size 226172318

data/kmeans_model_scibert.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b561ee3342b0b8646533e6b7ffd451234d76ce3695862fd17fad18787a3b47c
+size 967215

data/pair_classifier.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import torch
+from torch import nn
+class EmbeddingMLP(nn.Module):
+    def __init__(self, size=4):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(768 * size, 900 * size),
+            nn.BatchNorm1d(900 * size),
+            nn.ReLU(),
+            nn.Linear(900 * size, 300 * size)
+        )
+    def forward(self, data):
+        res = self.net(data)
+        return res
+class PairClassifier(nn.Module):
+    def __init__(self, size=4):
+        super().__init__()
+        self.encoder = EmbeddingMLP(size)
+        self.net = nn.Sequential(
+            nn.Linear(300 * size * 2, 3000),
+            nn.ReLU(),
+            nn.Linear(3000, 1000),
+            nn.ReLU(),
+            nn.Linear(1000, 2),
+        )
+    def forward(self, data):
+        e1 = self.encoder(data[:, :768 * 4])
+        e2 = self.encoder(data[:, 768 * 4:])
+        twins = torch.cat([e1, e2], dim=1)
+        res = self.net(twins)
+        return res

data/repo_clusters.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/repo_clusters_test.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/repo_doc.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from typing import List, Optional
+from docarray import BaseDoc
+from docarray.typing import NdArray
+class RepoDoc(BaseDoc):
+    """
+    The class for representing basic data structures.
+    """
+    name: str
+    topics: List[str]
+    stars: int
+    license: str
+    code_embedding: Optional[NdArray[768]]
+    doc_embedding: Optional[NdArray[768]]
+    readme_embedding: Optional[NdArray[768]]
+    requirement_embedding: Optional[NdArray[768]]
+    repository_embedding: Optional[NdArray[3072]]

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+accelerate
+docarray
+pandas
+numpy
+streamlit
+torch
+transformers
+tqdm
+scikit-learn
+nltk
+plotly
+joblib