Spaces:

Henry65
/

RepoSnipy

Sleeping

App Files Files Community

HenryStephen commited on Mar 10, 2024

Commit

c6a1f8c

1 Parent(s): c4d8fc9

topic cluster and code cluster

Browse files

Files changed (14) hide show

README.md +12 -6
app.py +154 -80
assets/Repository-Code Cluster Assignments.png +0 -0
assets/Repository-Topic Cluster Assignments.png +0 -0
common/__init__.py +0 -0
{data → common}/pair_classifier.py +0 -0
{data → common}/repo_doc.py +0 -0
data/{kmeans_model_scibert.pkl → kmeans_model_code_unixcoder.pkl} +1 -1
data/kmeans_model_topic_scibert.pkl +3 -0
data/repo_code_clusters.json +0 -0
data/repo_code_clusters_test.json +0 -0
data/{repo_clusters.json → repo_topic_clusters.json} +0 -0
data/{repo_clusters_test.json → repo_topic_clusters_test.json} +0 -0
requirements.txt +2 -1

README.md CHANGED Viewed

@@ -27,18 +27,19 @@ RepoSnipy is a neural search engine built with [streamlit](https://github.com/st
 Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
 * It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories.
-* Multi-level embeddings --- code, docstring, readme, requirement, and repository.
 * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
-* Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
-* It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster.
 More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
 The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
-We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
 ## Dataset
-As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset, [KMeans](data/kmeans_model_scibert.pkl) model and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
 ## License
@@ -51,4 +52,9 @@ The model and the fine-tuning dataset used:
 * [UniXCoder](https://arxiv.org/abs/2203.03850)
 * [AdvTest](https://arxiv.org/abs/1909.09436)
-* [SciBERT](https://arxiv.org/abs/1903.10676)

 Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
 * It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories.
+* Multi-level embeddings --- code, doc, readme, requirement, and repository.
 * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
+* Transfer multiple topics into one cluster --- it uses a KMeans model ([kmeans_model_topic_scibert](data/kmeans_model_topic_scibert.pkl)) to analyse topic embeddings and to cluster repositories based on topics.
+* Clustering by code snippets --- it uses a KMeans model ([kmeans_model_code_unixcoder](data/kmeans_model_code_unixcoder.pkl)) to analyse code embeddings and to cluster repositories based on code snippets.
+* It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on repository-level embeddings and cluster (topic or code cluster number).
 More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
 The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
+We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of March 2024. The accordingly generated clusters were putted in two json datasets ([repo_topic_clusters](data/repo_topic_clusters.json) and [repo_code_clusters](data/repo_code_clusters.json)) (stored repo-cluster as key-values accordingly).
 ## Dataset
+As mentioned above, RepoSnipy needs [vector](data/index.bin), clusters json dataset ([repo_topic_clusters](data/repo_topic_clusters.json) and [repo_code_clusters](data/repo_code_clusters.json)), KMeans models ([kmeans_model_topic_scibert](data/kmeans_model_topic_scibert.pkl) and [kmeans_model_code_unixcoder](data/kmeans_model_code_unixcoder.pkl)) and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
 ## License
 * [UniXCoder](https://arxiv.org/abs/2203.03850)
 * [AdvTest](https://arxiv.org/abs/1909.09436)
+* [SciBERT](https://arxiv.org/abs/1903.10676)
+* [RepoSnipy (old version)](https://github.com/RepoAnalysis/RepoSnipy)
+* [RepoSnipy HuggingFace Spaces (old version)](https://huggingface.co/spaces/Lazyhope/RepoSnipy)
+* [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py)
+* [SimilarityCal](https://github.com/RepoMining/SimilarityCal)
+* [RepoSnipy](https://github.com/RepoMining/RepoSnipy)

app.py CHANGED Viewed

@@ -12,13 +12,16 @@ from docarray import DocList
 from docarray.index import InMemoryExactNNIndex
 from transformers import pipeline
 from transformers import AutoTokenizer, AutoModel
-from data.repo_doc import RepoDoc
-from data.pair_classifier import PairClassifier
 from nltk.stem import WordNetLemmatizer
 nltk.download("wordnet")
-KMEANS_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_scibert.pkl")
 SIMILARITY_CAL_MODEL_PATH = Path(__file__).parent.joinpath("data/SimilarityCal_model_NO1.pt")
 device = (
     "cuda"
     if torch.cuda.is_available()
@@ -29,14 +32,13 @@ device = (
 # 1. Product environment
 # INDEX_PATH = Path(__file__).parent.joinpath("data/index.bin")
-# CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_clusters.json")
-SCIBERT_MODEL_PATH = "allenai/scibert_scivocab_uncased"
 # 2. Developing environment
 INDEX_PATH = Path(__file__).parent.joinpath("data/index_test.bin")
-CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_clusters_test.json")
-# SCIBERT_MODEL_PATH = Path(__file__).parent.joinpath("data/scibert_scivocab_uncased")  # Download locally
 @st.cache_resource(show_spinner="Loading repositories basic information...")
@@ -60,16 +62,28 @@ def load_index():
     return InMemoryExactNNIndex[RepoDoc](index_file_path=INDEX_PATH), default_doc
-@st.cache_resource(show_spinner="Loading repositories clusters...")
-def load_repo_clusters():
     """
-    The function to load the repo-clusters file
-    :return: a dictionary with the repo-clusters
     """
-    with open(CLUSTER_PATH, "r") as file:
-        repo_clusters = json.load(file)
-    return repo_clusters
 @st.cache_resource(show_spinner="Loading RepoSim4Py pipeline model...")
@@ -99,16 +113,26 @@ def load_scibert_model():
     """
     tokenizer = AutoTokenizer.from_pretrained(SCIBERT_MODEL_PATH)
     scibert_model = AutoModel.from_pretrained(SCIBERT_MODEL_PATH).to(device)
     return tokenizer, scibert_model
-@st.cache_resource(show_spinner="Loading KMeans model...")
-def load_kmeans_model():
     """
-    The function to load KMeans model
-    :return: a KMeans model
     """
-    return joblib.load(KMEANS_MODEL_PATH)
 @st.cache_resource(show_spinner="Loading SimilarityCal model...")
@@ -117,6 +141,7 @@ def load_similaritycal_model():
     sim_cal_model.load_state_dict(torch.load(SIMILARITY_CAL_MODEL_PATH, map_location=device))
     sim_cal_model = sim_cal_model.to(device)
     sim_cal_model = sim_cal_model.eval()
     return sim_cal_model
@@ -130,6 +155,7 @@ def generate_scibert_embedding(tokenizer, scibert_model, text):
     outputs = scibert_model(**inputs)
     # Use mean pooling for sentence representation
     embeddings = outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
     return embeddings
@@ -150,8 +176,10 @@ def run_pipeline_model(_model, repo_name, github_token):
     if not extracted_infos:
         return None
     with st.spinner(f"Generating embeddings for {repo_name}..."):
-        repo_info = _model.forward(extracted_infos)[0]
     return repo_info
@@ -175,36 +203,50 @@ def run_index_search(index, query, search_field, limit):
     return search_results
-def run_cluster_search(repo_clusters, repo_name_list):
     """
-    The function to search cluster number for such repositories.
-    :param repo_clusters: dictionary with repo-clusters
     :param repo_name_list: list or array represent repository names
-    :return: cluster number list
     """
-    clusters = []
     for repo_name in repo_name_list:
-        clusters.append(repo_clusters[repo_name])
-    return clusters
-def run_similaritycal_search(index, repo_clusters, model, query_doc, query_cluster_number, limit, same_cluster=True):
     """
     The function to run SimilarityCal model.
     :param index: index file
-    :param repo_clusters: repo-clusters json file
     :param model: SimilarityCal model
     :param query_doc: query repo doc
-    :param query_cluster_number: query repo cluster number
     :param limit: limit
-    :param same_cluster: whether searching for same cluster
     :return: result dataframe
     """
     docs = index._docs
     input_embeddings_list = []
     result_dl = DocList[RepoDoc]()
     for doc in docs:
-        if same_cluster and query_cluster_number != repo_clusters[doc.name]:
             continue
         if doc.name != query_doc.name:
             e1, e2 = (torch.Tensor(query_doc.repository_embedding),
@@ -219,17 +261,24 @@ def run_similaritycal_search(index, repo_clusters, model, query_doc, query_clust
     similarity_scores = softmax(model_output)[:, 1].cpu().detach().numpy()
     df = result_dl.to_dataframe()
     df["scores"] = similarity_scores
-    return df.sort_values(by='scores', ascending=False).reset_index(drop=True).head(limit)
 if __name__ == "__main__":
     # Loading dataset and models
     index, default_doc = load_index()
-    repo_clusters = load_repo_clusters()
     pipeline_model = load_pipeline_model()
     lemmatizer = WordNetLemmatizer()
     tokenizer, scibert_model = load_scibert_model()
-    kmeans = load_kmeans_model()
     sim_cal_model = load_similaritycal_model()
     # Setting the sidebar
@@ -254,8 +303,8 @@ if __name__ == "__main__":
         st.multiselect(
             label="Display columns",
-            options=["scores", "name", "topics", "cluster number", "stars", "license"],
-            default=["scores", "name", "topics", "cluster number", "stars", "license"],
             help="Select columns to display in the search results",
             key="display_columns",
         )
@@ -291,10 +340,11 @@ if __name__ == "__main__":
             records = index.filter({"name": {"$eq": repo_name}})
             # 1) Building the query information
             query_doc = default_doc.copy() if not records else records[0]
-            # 2) Recording the cluster number
-            cluster_number = -1 if not records else repo_clusters[repo_name]
-            # Importance 1 ---- situation need to update repository information and cluster number
             if st.session_state.update_index or not records:
                 # 1) Updating repository information by using RepoSim4Py pipeline
                 repo_info = run_pipeline_model(pipeline_model, repo_name, st.session_state.github_token)
@@ -317,13 +367,18 @@ if __name__ == "__main__":
                 query_doc.repository_embedding = None if np.all(repo_info["mean_repo_embedding"] == 0) else repo_info[
                     "mean_repo_embedding"].reshape(-1)
-                # 2) Updating cluster number
                 topics_text = ' '.join(
                     [lemmatizer.lemmatize(topic.lower().replace('-', ' ')) for topic in query_doc.topics])
                 topic_embeddings = generate_scibert_embedding(tokenizer, scibert_model, topics_text)
-                cluster_number = int(kmeans.predict(topic_embeddings)[0])
-            # Importance 2 ---- update index file and repository clusters file
             if st.session_state.update_index:
                 if not query_doc.license:
                     st.warning(
@@ -337,19 +392,24 @@ if __name__ == "__main__":
                     )
                 else:
                     index.index(query_doc)
-                    repo_clusters[query_doc.name] = cluster_number
-                    with st.spinner("Persisting the index and repository clusters..."):
                         index.persist(str(INDEX_PATH))
-                        with open(CLUSTER_PATH, "w") as file:
-                            json.dump(repo_clusters, file, indent=4)
                         st.success("Repository updated to the index!")
                     load_index.clear()
-                    load_repo_clusters.clear()
             st.session_state["query_doc"] = query_doc
-            st.session_state["cluster_number"] = cluster_number
         # 2. Repository cannot be matched
         else:
@@ -358,7 +418,8 @@ if __name__ == "__main__":
     # Starting to query
     if "query_doc" in st.session_state:
         query_doc = st.session_state.query_doc
-        cluster_number = st.session_state.cluster_number
         limit = st.session_state.search_results_limit
         # Showing the query repository information
@@ -368,7 +429,8 @@ if __name__ == "__main__":
                     {
                         "name": query_doc.name,
                         "topics": query_doc.topics,
-                        "cluster number": cluster_number,
                         "stars": query_doc.stars,
                         "license": query_doc.license,
                     }
@@ -377,15 +439,18 @@ if __name__ == "__main__":
         )
         display_columns = st.session_state.display_columns
-        code_sim_tab, doc_sim_tab, readme_sim_tab, requirement_sim_tab, repo_sim_tab, cluster_tab, same_cluster_tab, = st.tabs(
             ["Code_sim", "Docstring_sim", "Readme_sim", "Requirement_sim",
-             "Repository_sim", "Cluster_sim", "Same_cluster_sim"])
         with code_sim_tab:
             if query_doc.code_embedding is not None:
                 code_sim_res = run_index_search(index, query_doc, "code_embedding", limit)
-                cluster_numbers = run_cluster_search(repo_clusters, code_sim_res["name"])
-                code_sim_res["cluster number"] = cluster_numbers
                 st.dataframe(code_sim_res[display_columns])
             else:
                 st.error("No function code was extracted for this repository!")
@@ -393,8 +458,10 @@ if __name__ == "__main__":
         with doc_sim_tab:
             if query_doc.doc_embedding is not None:
                 doc_sim_res = run_index_search(index, query_doc, "doc_embedding", limit)
-                cluster_numbers = run_cluster_search(repo_clusters, doc_sim_res["name"])
-                doc_sim_res["cluster number"] = cluster_numbers
                 st.dataframe(doc_sim_res[display_columns])
             else:
                 st.error("No function docstring was extracted for this repository!")
@@ -402,8 +469,10 @@ if __name__ == "__main__":
         with readme_sim_tab:
             if query_doc.readme_embedding is not None:
                 readme_sim_res = run_index_search(index, query_doc, "readme_embedding", limit)
-                cluster_numbers = run_cluster_search(repo_clusters, readme_sim_res["name"])
-                readme_sim_res["cluster number"] = cluster_numbers
                 st.dataframe(readme_sim_res[display_columns])
             else:
                 st.error("No readme file was extracted for this repository!")
@@ -411,8 +480,10 @@ if __name__ == "__main__":
         with requirement_sim_tab:
             if query_doc.requirement_embedding is not None:
                 requirement_sim_res = run_index_search(index, query_doc, "requirement_embedding", limit)
-                cluster_numbers = run_cluster_search(repo_clusters, requirement_sim_res["name"])
-                requirement_sim_res["cluster number"] = cluster_numbers
                 st.dataframe(requirement_sim_res[display_columns])
             else:
                 st.error("No requirement file was extracted for this repository!")
@@ -421,31 +492,34 @@ if __name__ == "__main__":
             if query_doc.repository_embedding is not None:
                 # Repo Sim tab
                 repo_sim_res = run_index_search(index, query_doc, "repository_embedding", limit)
-                cluster_numbers = run_cluster_search(repo_clusters, repo_sim_res["name"])
-                repo_sim_res["cluster number"] = cluster_numbers
                 st.dataframe(repo_sim_res[display_columns])
             else:
                 st.error("No such useful information was extracted for this repository!")
-        with cluster_tab:
             if query_doc.repository_embedding is not None:
-                cluster_df = run_similaritycal_search(index, repo_clusters, sim_cal_model,
-                                                      query_doc, cluster_number, limit,
-                                                      same_cluster=False)
-                cluster_numbers = run_cluster_search(repo_clusters, cluster_df["name"])
-                cluster_df["cluster number"] = cluster_numbers
-                st.dataframe(cluster_df[display_columns])
             else:
                 st.error("No such useful information was extracted for this repository!")
-        with same_cluster_tab:
             if query_doc.repository_embedding is not None:
-                # Cluster tab and same cluster tab
-                same_cluster_df = run_similaritycal_search(index, repo_clusters, sim_cal_model,
-                                                           query_doc, cluster_number, limit,
-                                                           same_cluster=True)
-                same_cluster_numbers = run_cluster_search(repo_clusters, same_cluster_df["name"])
-                same_cluster_df["cluster number"] = same_cluster_numbers
-                same_cluster_tab.dataframe(same_cluster_df[display_columns])
             else:
-                same_cluster_tab.error("No such useful information was extracted for this repository!")

 from docarray.index import InMemoryExactNNIndex
 from transformers import pipeline
 from transformers import AutoTokenizer, AutoModel
+from common.repo_doc import RepoDoc
+from common.pair_classifier import PairClassifier
 from nltk.stem import WordNetLemmatizer
 nltk.download("wordnet")
+KMEANS_TOPIC_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_topic_scibert.pkl")
+KMEANS_CODE_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_code_unixcoder.pkl")
 SIMILARITY_CAL_MODEL_PATH = Path(__file__).parent.joinpath("data/SimilarityCal_model_NO1.pt")
+SCIBERT_MODEL_PATH = "allenai/scibert_scivocab_uncased"
+# SCIBERT_MODEL_PATH = Path(__file__).parent.joinpath("data/scibert_scivocab_uncased")  # Download locally
 device = (
     "cuda"
     if torch.cuda.is_available()
 # 1. Product environment
 # INDEX_PATH = Path(__file__).parent.joinpath("data/index.bin")
+# TOPIC_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_topic_clusters.json")
+# CODE_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_code_clusters.json")
 # 2. Developing environment
 INDEX_PATH = Path(__file__).parent.joinpath("data/index_test.bin")
+TOPIC_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_topic_clusters_test.json")
+CODE_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_code_clusters_test.json")
 @st.cache_resource(show_spinner="Loading repositories basic information...")
     return InMemoryExactNNIndex[RepoDoc](index_file_path=INDEX_PATH), default_doc
+@st.cache_resource(show_spinner="Loading repositories topic clusters...")
+def load_repo_topic_clusters():
+    """
+    The function to load the repo-topic_clusters file
+    :return: a dictionary with the repo-topic_clusters
+    """
+    with open(TOPIC_CLUSTER_PATH, "r") as file:
+        repo_topic_clusters = json.load(file)
+    return repo_topic_clusters
+@st.cache_resource(show_spinner="Loading repositories code clusters...")
+def load_repo_code_clusters():
     """
+    The function to load the repo-code_clusters file
+    :return: a dictionary with the repo-code_clusters
     """
+    with open(CODE_CLUSTER_PATH, "r") as file:
+        repo_code_clusters = json.load(file)
+    return repo_code_clusters
 @st.cache_resource(show_spinner="Loading RepoSim4Py pipeline model...")
     """
     tokenizer = AutoTokenizer.from_pretrained(SCIBERT_MODEL_PATH)
     scibert_model = AutoModel.from_pretrained(SCIBERT_MODEL_PATH).to(device)
     return tokenizer, scibert_model
+@st.cache_resource(show_spinner="Loading KMeans model (topic clusters)...")
+def load_topic_kmeans_model():
+    """
+    The function to load KMeans model (topic clusters)
+    :return: a KMeans model (topic clusters)
+    """
+    return joblib.load(KMEANS_TOPIC_MODEL_PATH)
+@st.cache_resource(show_spinner="Loading KMeans model (code clusters)...")
+def load_code_kmeans_model():
     """
+    The function to load KMeans model (code clusters)
+    :return: a KMeans model (code clusters)
     """
+    return joblib.load(KMEANS_CODE_MODEL_PATH)
 @st.cache_resource(show_spinner="Loading SimilarityCal model...")
     sim_cal_model.load_state_dict(torch.load(SIMILARITY_CAL_MODEL_PATH, map_location=device))
     sim_cal_model = sim_cal_model.to(device)
     sim_cal_model = sim_cal_model.eval()
     return sim_cal_model
     outputs = scibert_model(**inputs)
     # Use mean pooling for sentence representation
     embeddings = outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
     return embeddings
     if not extracted_infos:
         return None
+    st_proress_bar = st.progress(0.0)
     with st.spinner(f"Generating embeddings for {repo_name}..."):
+        repo_info = _model.forward(extracted_infos, st_progress=st_proress_bar)[0]
+    st_proress_bar.empty()
     return repo_info
     return search_results
+def run_topic_cluster_search(repo_topic_clusters, repo_name_list):
+    """
+    The function to search topic cluster number for such repositories.
+    :param repo_topic_clusters: dictionary with repo-topic_clusters
+    :param repo_name_list: list or array represent repository names
+    :return: topic cluster number list
+    """
+    topic_clusters = []
+    for repo_name in repo_name_list:
+        topic_clusters.append(repo_topic_clusters[repo_name])
+    return topic_clusters
+def run_code_cluster_search(repo_code_clusters, repo_name_list):
     """
+    The function to search code cluster number for such repositories.
+    :param repo_code_clusters: dictionary with repo-code_clusters
     :param repo_name_list: list or array represent repository names
+    :return: code cluster number list
     """
+    code_clusters = []
     for repo_name in repo_name_list:
+        code_clusters.append(repo_code_clusters[repo_name])
+    return code_clusters
+def run_similaritycal_search(index, repo_clusters, model, query_doc, query_cluster_number, limit):
     """
     The function to run SimilarityCal model.
     :param index: index file
+    :param repo_clusters: repo-clusters (topic_cluster or code_cluster) json file
     :param model: SimilarityCal model
     :param query_doc: query repo doc
+    :param query_cluster_number: query repo cluster number (code or topic)
     :param limit: limit
     :return: result dataframe
     """
     docs = index._docs
     input_embeddings_list = []
     result_dl = DocList[RepoDoc]()
     for doc in docs:
+        if query_cluster_number != repo_clusters[doc.name]:
             continue
         if doc.name != query_doc.name:
             e1, e2 = (torch.Tensor(query_doc.repository_embedding),
     similarity_scores = softmax(model_output)[:, 1].cpu().detach().numpy()
     df = result_dl.to_dataframe()
     df["scores"] = similarity_scores
+    sorted_df = df.sort_values(by='scores', ascending=False).reset_index(drop=True).head(limit)
+    sorted_df["rankings"] = sorted_df["scores"].rank(ascending=False).astype(int)
+    sorted_df.drop(columns="scores", inplace=True)
+    return sorted_df
 if __name__ == "__main__":
     # Loading dataset and models
     index, default_doc = load_index()
+    repo_topic_clusters = load_repo_topic_clusters()
+    repo_code_clusters = load_repo_code_clusters()
     pipeline_model = load_pipeline_model()
     lemmatizer = WordNetLemmatizer()
     tokenizer, scibert_model = load_scibert_model()
+    topic_kmeans = load_topic_kmeans_model()
+    code_kmeans = load_code_kmeans_model()
     sim_cal_model = load_similaritycal_model()
     # Setting the sidebar
         st.multiselect(
             label="Display columns",
+            options=["scores", "name", "topics", "code cluster", "topic cluster", "stars", "license"],
+            default=["scores", "name", "topics", "code cluster", "topic cluster", "stars", "license"],
             help="Select columns to display in the search results",
             key="display_columns",
         )
             records = index.filter({"name": {"$eq": repo_name}})
             # 1) Building the query information
             query_doc = default_doc.copy() if not records else records[0]
+            # 2) Recording the topic and code cluster numbers
+            topic_cluster_number = -1 if not records else repo_topic_clusters[repo_name]
+            code_cluster_number = -1 if not records else repo_code_clusters[repo_name]
+            # Importance 1 ---- situation need to update repository information and cluster numbers
             if st.session_state.update_index or not records:
                 # 1) Updating repository information by using RepoSim4Py pipeline
                 repo_info = run_pipeline_model(pipeline_model, repo_name, st.session_state.github_token)
                 query_doc.repository_embedding = None if np.all(repo_info["mean_repo_embedding"] == 0) else repo_info[
                     "mean_repo_embedding"].reshape(-1)
+                # 2) Updating topic cluster number
                 topics_text = ' '.join(
                     [lemmatizer.lemmatize(topic.lower().replace('-', ' ')) for topic in query_doc.topics])
                 topic_embeddings = generate_scibert_embedding(tokenizer, scibert_model, topics_text)
+                topic_cluster_number = int(topic_kmeans.predict(topic_embeddings)[0])
+                # 3) Updating code cluster number
+                code_embeddings = np.zeros((768,),
+                                           dtype=np.float32) if query_doc.code_embedding is None else query_doc.code_embedding
+                code_cluster_number = int(code_kmeans.predict(code_embeddings.reshape(1, -1))[0])
+            # Importance 2 ---- update index file and repository clusters (topic and code) files
             if st.session_state.update_index:
                 if not query_doc.license:
                     st.warning(
                     )
                 else:
                     index.index(query_doc)
+                    repo_topic_clusters[query_doc.name] = topic_cluster_number
+                    repo_code_clusters[query_doc.name] = code_cluster_number
+                    with st.spinner("Persisting the index and repository clusters (topic and code)..."):
                         index.persist(str(INDEX_PATH))
+                        with open(TOPIC_CLUSTER_PATH, "w") as file:
+                            json.dump(repo_topic_clusters, file, indent=4)
+                        with open(CODE_CLUSTER_PATH, "w") as file:
+                            json.dump(repo_code_clusters, file, indent=4)
                         st.success("Repository updated to the index!")
                     load_index.clear()
+                    load_repo_topic_clusters.clear()
+                    load_repo_code_clusters.clear()
             st.session_state["query_doc"] = query_doc
+            st.session_state["topic_cluster_number"] = topic_cluster_number
+            st.session_state["code_cluster_number"] = code_cluster_number
         # 2. Repository cannot be matched
         else:
     # Starting to query
     if "query_doc" in st.session_state:
         query_doc = st.session_state.query_doc
+        topic_cluster_number = st.session_state.topic_cluster_number
+        code_cluster_number = st.session_state.code_cluster_number
         limit = st.session_state.search_results_limit
         # Showing the query repository information
                     {
                         "name": query_doc.name,
                         "topics": query_doc.topics,
+                        "topic cluster": topic_cluster_number,
+                        "code cluster": code_cluster_number,
                         "stars": query_doc.stars,
                         "license": query_doc.license,
                     }
         )
         display_columns = st.session_state.display_columns
+        modified_display_columns = ["rankings" if col == "scores" else col for col in display_columns]
+        code_sim_tab, doc_sim_tab, readme_sim_tab, requirement_sim_tab, repo_sim_tab, code_cluster_tab, topic_cluster_tab, = st.tabs(
             ["Code_sim", "Docstring_sim", "Readme_sim", "Requirement_sim",
+             "Repository_sim", "Code_cluster_sim", "Topic_cluster_sim"])
         with code_sim_tab:
             if query_doc.code_embedding is not None:
                 code_sim_res = run_index_search(index, query_doc, "code_embedding", limit)
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, code_sim_res["name"])
+                code_sim_res["topic cluster"] = topic_cluster_numbers
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, code_sim_res["name"])
+                code_sim_res["code cluster"] = code_cluster_numbers
                 st.dataframe(code_sim_res[display_columns])
             else:
                 st.error("No function code was extracted for this repository!")
         with doc_sim_tab:
             if query_doc.doc_embedding is not None:
                 doc_sim_res = run_index_search(index, query_doc, "doc_embedding", limit)
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, doc_sim_res["name"])
+                doc_sim_res["topic cluster"] = topic_cluster_numbers
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, doc_sim_res["name"])
+                doc_sim_res["code cluster"] = code_cluster_numbers
                 st.dataframe(doc_sim_res[display_columns])
             else:
                 st.error("No function docstring was extracted for this repository!")
         with readme_sim_tab:
             if query_doc.readme_embedding is not None:
                 readme_sim_res = run_index_search(index, query_doc, "readme_embedding", limit)
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, readme_sim_res["name"])
+                readme_sim_res["topic cluster"] = topic_cluster_numbers
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, readme_sim_res["name"])
+                readme_sim_res["code cluster"] = code_cluster_numbers
                 st.dataframe(readme_sim_res[display_columns])
             else:
                 st.error("No readme file was extracted for this repository!")
         with requirement_sim_tab:
             if query_doc.requirement_embedding is not None:
                 requirement_sim_res = run_index_search(index, query_doc, "requirement_embedding", limit)
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, requirement_sim_res["name"])
+                requirement_sim_res["topic cluster"] = topic_cluster_numbers
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, requirement_sim_res["name"])
+                requirement_sim_res["code cluster"] = code_cluster_numbers
                 st.dataframe(requirement_sim_res[display_columns])
             else:
                 st.error("No requirement file was extracted for this repository!")
             if query_doc.repository_embedding is not None:
                 # Repo Sim tab
                 repo_sim_res = run_index_search(index, query_doc, "repository_embedding", limit)
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, repo_sim_res["name"])
+                repo_sim_res["topic cluster"] = topic_cluster_numbers
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, repo_sim_res["name"])
+                repo_sim_res["code cluster"] = code_cluster_numbers
                 st.dataframe(repo_sim_res[display_columns])
             else:
                 st.error("No such useful information was extracted for this repository!")
+        with code_cluster_tab:
             if query_doc.repository_embedding is not None:
+                cluster_df = run_similaritycal_search(index, repo_code_clusters, sim_cal_model,
+                                                      query_doc, code_cluster_number, limit)
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, cluster_df["name"])
+                cluster_df["code cluster"] = code_cluster_numbers
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, cluster_df["name"])
+                cluster_df["topic cluster"] = topic_cluster_numbers
+                st.dataframe(cluster_df[modified_display_columns])
             else:
                 st.error("No such useful information was extracted for this repository!")
+        with topic_cluster_tab:
             if query_doc.repository_embedding is not None:
+                cluster_df = run_similaritycal_search(index, repo_topic_clusters, sim_cal_model,
+                                                      query_doc, topic_cluster_number, limit)
+                topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, cluster_df["name"])
+                cluster_df["topic cluster"] = topic_cluster_numbers
+                code_cluster_numbers = run_code_cluster_search(repo_code_clusters, cluster_df["name"])
+                cluster_df["code cluster"] = code_cluster_numbers
+                st.dataframe(cluster_df[modified_display_columns])
             else:
+                topic_cluster_tab.error("No such useful information was extracted for this repository!")

assets/Repository-Code Cluster Assignments.png ADDED Viewed

assets/Repository-Topic Cluster Assignments.png ADDED Viewed

common/__init__.py ADDED Viewed

File without changes

{data → common}/pair_classifier.py RENAMED Viewed

File without changes

{data → common}/repo_doc.py RENAMED Viewed

File without changes

data/{kmeans_model_scibert.pkl → kmeans_model_code_unixcoder.pkl} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7b561ee3342b0b8646533e6b7ffd451234d76ce3695862fd17fad18787a3b47c
 size 967215

 version https://git-lfs.github.com/spec/v1
+oid sha256:bb534645bce9fb19975873003be27e0b386df7550693caed46ee0f1822b16533
 size 967215

data/kmeans_model_topic_scibert.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48272b4172b3dba079348462044f72f19a004ff65d6cd9222ef424468261f1fb
+size 967215

data/repo_code_clusters.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/repo_code_clusters_test.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/{repo_clusters.json → repo_topic_clusters.json} RENAMED Viewed

File without changes

data/{repo_clusters_test.json → repo_topic_clusters_test.json} RENAMED Viewed

File without changes

requirements.txt CHANGED Viewed

@@ -9,4 +9,5 @@ tqdm
 scikit-learn
 nltk
 plotly
-joblib

 scikit-learn
 nltk
 plotly
+joblib
+matplotlib