Spaces:

HUBioDataLab
/

ProtHGT

Running

App Files Files Community

Erva Ulusoy commited on Mar 21

Commit

14c3500

1 Parent(s): 85b27f1

include second degree edges (major update)

Browse files

Files changed (2) hide show

ProtHGT_app.py +75 -41
visualize_kg.py +94 -4

ProtHGT_app.py CHANGED Viewed

@@ -70,7 +70,7 @@ with st.expander("🚀 Upcoming Features"):
     - **Real-time data retrieval for new proteins**: Currently, ProtHGT can only generate predictions for proteins that already exist in our knowledge graph. We are developing a new feature that will allow users to **predict functions for entirely new proteins starting from their sequences**. This will work by **retrieving relevant relationship data in real time from external source databases** (e.g., UniProt, STRING, and other biological repositories). The system will dynamically construct a knowledge graph for the query protein, incorporating its interactions, domains, pathways, and other biological associations before running function prediction. This approach will enable ProtHGT to analyze newly discovered or less-studied proteins even if they are not pre-annotated in our dataset.
     - **Expanded embedding options**: Currently, this application represents proteins using **TAPE embeddings**, which serve as the initial numerical representations of protein sequences before being processed in the heterogeneous graph model. We are working on integrating **ProtT5** and **ESM-2** as alternative initial embeddings, allowing users to choose different sequence representations that may enhance performance for specific tasks. A detailed comparison of how these embeddings influence function prediction accuracy will be included in our upcoming publication.
-    - **Knowledge graph visualization for interpretability**: To improve model explainability, we are developing an interactive **knowledge graph visualization** feature. This will allow users to explore the biological relationships that contributed to ProtHGT’s predictions for a given protein. Users will be able to inspect **protein interactions, GO annotations, domains, pathways, and other key connections** in a structured graphical format, making it easier to interpret and validate predictions.
     Stay tuned for updates and future publications!
     """)
@@ -562,78 +562,112 @@ if st.session_state.submitted:
                 # Create visualizations in each tab
                 for idx, protein_id in enumerate(selected_proteins):
                     with protein_tabs[idx]:
-                        max_node_count = st.slider(
-                            "Maximum neighbors per edge type",
-                            min_value=5,
-                            max_value=50,
-                            value=10,
-                            step=5,
-                            help="Control the maximum number of neighboring nodes shown for each relationship type",
-                            key=f"slider_{protein_id}"
-                        )
-                        # Check if visualization exists for this protein
                         viz_exists = (protein_id in st.session_state.protein_visualizations and
-                                    os.path.exists(st.session_state.protein_visualizations[protein_id]['path']))
                         if not viz_exists:
                             if st.button(f"Generate Visualization", key=f"viz_{protein_id}"):
-                                # Generate visualization with selected max_node_count
-                                html_path, visualized_edges = visualize_protein_subgraph(
                                     st.session_state.heterodata,
                                     protein_id,
                                     st.session_state.predictions_df,
-                                    limit=max_node_count
                                 )
-                                # Store visualization info in session state
-                                st.session_state.protein_visualizations[protein_id] = {
-                                    'path': html_path,
-                                    'edges': visualized_edges
                                 }
                                 st.rerun()
-                        # If visualization exists, display it
                         if viz_exists:
-                            viz_info = st.session_state.protein_visualizations[protein_id]
-                            # Add download button for edges
-                            formatted_edges = {}
-                            for edge_type, edges in viz_info['edges'].items():
-                                edge_type_str = f"{edge_type[0]}_{edge_type[1]}_{edge_type[2]}"
-                                formatted_edges[edge_type_str] = [
-                                    {"source": edge[0][0], "target": edge[0][1], "probability": edge[1]}
-                                    for edge in edges
-                                ]
                             kg_viz_button_columns = st.columns([1, 1, 1])
                             with kg_viz_button_columns[0]:
                                 st.download_button(
                                     label='Download Visualized Edges',
                                     data=json.dumps(formatted_edges, indent=2),
-                                    file_name=f'{protein_id}_visualized_edges.json',
                                     mime='application/json'
                                 )
                             with kg_viz_button_columns[1]:
                                 if st.button("Regenerate Visualization", key=f"regenerate_{protein_id}"):
-                                    # Clean up old file
-                                    try:
-                                        os.unlink(viz_info['path'])
-                                    except FileNotFoundError:
-                                        pass
-                                    # Remove from session state
-                                    del st.session_state.protein_visualizations[protein_id]
                                     st.rerun()
                             with open(viz_info['path'], 'r', encoding='utf-8') as f:
                                 html_content = f.read()
                             st.components.v1.html(html_content, height=1200)
             else:
                 st.warning("Knowledge graph visualization is only available when 10 or fewer proteins are selected.")

     - **Real-time data retrieval for new proteins**: Currently, ProtHGT can only generate predictions for proteins that already exist in our knowledge graph. We are developing a new feature that will allow users to **predict functions for entirely new proteins starting from their sequences**. This will work by **retrieving relevant relationship data in real time from external source databases** (e.g., UniProt, STRING, and other biological repositories). The system will dynamically construct a knowledge graph for the query protein, incorporating its interactions, domains, pathways, and other biological associations before running function prediction. This approach will enable ProtHGT to analyze newly discovered or less-studied proteins even if they are not pre-annotated in our dataset.
     - **Expanded embedding options**: Currently, this application represents proteins using **TAPE embeddings**, which serve as the initial numerical representations of protein sequences before being processed in the heterogeneous graph model. We are working on integrating **ProtT5** and **ESM-2** as alternative initial embeddings, allowing users to choose different sequence representations that may enhance performance for specific tasks. A detailed comparison of how these embeddings influence function prediction accuracy will be included in our upcoming publication.
+    - **Knowledge graph visualization for interpretability**: To improve model explainability, we are developing an interactive **knowledge graph visualization** feature. This will allow users to explore the biological relationships that contributed to ProtHGT's predictions for a given protein. Users will be able to inspect **protein interactions, GO annotations, domains, pathways, and other key connections** in a structured graphical format, making it easier to interpret and validate predictions.
     Stay tuned for updates and future publications!
     """)
                 # Create visualizations in each tab
                 for idx, protein_id in enumerate(selected_proteins):
                     with protein_tabs[idx]:
+                        col1, col2 = st.columns([3, 1])
+                        with col1:
+                            max_node_count = st.slider(
+                                "Maximum neighbors per edge type",
+                                min_value=5,
+                                max_value=50,
+                                value=10,
+                                step=5,
+                                help="Control the maximum number of neighboring nodes shown for each relationship type",
+                                key=f"slider_{protein_id}"
+                            )
+                        # Check if both visualizations exist for this protein
                         viz_exists = (protein_id in st.session_state.protein_visualizations and
+                                     'first_degree' in st.session_state.protein_visualizations[protein_id] and
+                                     'second_degree' in st.session_state.protein_visualizations[protein_id])
                         if not viz_exists:
                             if st.button(f"Generate Visualization", key=f"viz_{protein_id}"):
+                                # Initialize the protein's visualizations if not exists
+                                if protein_id not in st.session_state.protein_visualizations:
+                                    st.session_state.protein_visualizations[protein_id] = {}
+                                # Generate both visualizations upfront
+                                # First degree only
+                                html_path_1st, edges_1st = visualize_protein_subgraph(
                                     st.session_state.heterodata,
                                     protein_id,
                                     st.session_state.predictions_df,
+                                    limit=max_node_count,
+                                    include_second_degree=False
                                 )
+                                # With second degree
+                                html_path_2nd, edges_2nd = visualize_protein_subgraph(
+                                    st.session_state.heterodata,
+                                    protein_id,
+                                    st.session_state.predictions_df,
+                                    limit=max_node_count,
+                                    include_second_degree=True
+                                )
+                                # Store both visualizations in session state
+                                st.session_state.protein_visualizations[protein_id]['first_degree'] = {
+                                    'path': html_path_1st,
+                                    'edges': edges_1st
+                                }
+                                st.session_state.protein_visualizations[protein_id]['second_degree'] = {
+                                    'path': html_path_2nd,
+                                    'edges': edges_2nd
                                 }
                                 st.rerun()
+                        # If visualization exists, show the toggle and display appropriate version
                         if viz_exists:
+                            with col2:
+                                include_second_degree = st.checkbox(
+                                    "Include second-degree edges",
+                                    value=False,
+                                    key=f"second_degree_{protein_id}",
+                                    help="Show connections between neighbor nodes"
+                                )
+                            # Get the appropriate visualization based on checkbox
+                            viz_type = 'second_degree' if include_second_degree else 'first_degree'
+                            viz_info = st.session_state.protein_visualizations[protein_id][viz_type]
                             kg_viz_button_columns = st.columns([1, 1, 1])
                             with kg_viz_button_columns[0]:
+                                # Format edges for download
+                                formatted_edges = {}
+                                for edge_type, edges in viz_info['edges'].items():
+                                    edge_type_str = f"{edge_type[0]}_{edge_type[1]}_{edge_type[2]}"
+                                    formatted_edges[edge_type_str] = [
+                                        {"source": edge[0][0], "target": edge[0][1], "probability": edge[1]}
+                                        for edge in edges
+                                    ]
                                 st.download_button(
                                     label='Download Visualized Edges',
                                     data=json.dumps(formatted_edges, indent=2),
+                                    file_name=f'{protein_id}_visualized_edges{"_with_2nd_degree" if include_second_degree else ""}.json',
                                     mime='application/json'
                                 )
                             with kg_viz_button_columns[1]:
                                 if st.button("Regenerate Visualization", key=f"regenerate_{protein_id}"):
+                                    # Clean up old files
+                                    if protein_id in st.session_state.protein_visualizations:
+                                        for viz_type in ['first_degree', 'second_degree']:
+                                            if viz_type in st.session_state.protein_visualizations[protein_id]:
+                                                try:
+                                                    old_path = st.session_state.protein_visualizations[protein_id][viz_type]['path']
+                                                    os.unlink(old_path)
+                                                except:
+                                                    pass
+                                        # Remove from session state
+                                        del st.session_state.protein_visualizations[protein_id]
                                     st.rerun()
+                            # Display the appropriate visualization
                             with open(viz_info['path'], 'r', encoding='utf-8') as f:
                                 html_content = f.read()
                             st.components.v1.html(html_content, height=1200)
             else:
                 st.warning("Knowledge graph visualization is only available when 10 or fewer proteins are selected.")

visualize_kg.py CHANGED Viewed

@@ -22,13 +22,20 @@ EDGE_LABEL_TRANSLATION = {
     'Orthology': 'is ortholog to',
     'Pathway': 'takes part in',
     'kegg_path_prot': 'takes part in',
     'protein_domain': 'has',
     'PPI': 'interacts with',
     'HPO': 'is associated with',
     'kegg_dis_prot': 'is related to',
     'Disease': 'is related to',
     'Drug': 'targets',
     'protein_ec': 'catalyzes',
     'Chembl': 'targets',
     ('protein_function', 'GO_term_F'): 'enables',
     ('protein_function', 'GO_term_P'): 'is involved in',
@@ -168,14 +175,96 @@ def _filter_edges(protein_id, protein_edges, prediction_df, limit=10):
     return filtered_edges
-def visualize_protein_subgraph(data, protein_id, prediction_df, limit=10):
     with gzip.open('data/name_info.json.gz', 'rt', encoding='utf-8') as file:
         name_info = json.load(file)
     protein_edges = _gather_protein_edges(data, protein_id)
-    visualized_edges = _filter_edges(protein_id, protein_edges, prediction_df, limit)
     print(f'Edges to be visualized: {visualized_edges}')
     net = Network(height="600px", width="100%", directed=True, notebook=False)
@@ -259,7 +348,7 @@ def visualize_protein_subgraph(data, protein_id, prediction_df, limit=10):
     for edge_type, edges in visualized_edges.items():
         source_type, relation_type, target_type = edge_type
-        if relation_type == 'protein_function':
             relation_type = EDGE_LABEL_TRANSLATION[(relation_type, target_type)]
         else:
             relation_type = EDGE_LABEL_TRANSLATION[relation_type]
@@ -449,7 +538,8 @@ def visualize_protein_subgraph(data, protein_id, prediction_df, limit=10):
     # Save graph to a protein-specific file in a temporary directory
     os.makedirs('temp_viz', exist_ok=True)
-    file_path = os.path.join('temp_viz', f'{protein_id}_graph.html')
     net.save_graph(file_path)

     'Orthology': 'is ortholog to',
     'Pathway': 'takes part in',
     'kegg_path_prot': 'takes part in',
+    ('domain_function', 'GO_term_F'): 'enables',
+    ('domain_function', 'GO_term_P'): 'is involved in',
+    ('domain_function', 'GO_term_C'): 'localizes to',
+    'function_function': 'ontological relationship',
     'protein_domain': 'has',
     'PPI': 'interacts with',
     'HPO': 'is associated with',
     'kegg_dis_prot': 'is related to',
     'Disease': 'is related to',
     'Drug': 'targets',
+    'kegg_dis_path': 'modulates',
     'protein_ec': 'catalyzes',
+    'hpodis': 'is associated with',
+    'kegg_dis_drug': 'treats',
     'Chembl': 'targets',
     ('protein_function', 'GO_term_F'): 'enables',
     ('protein_function', 'GO_term_P'): 'is involved in',
     return filtered_edges
+def _gather_neighbor_edges(data, node_id, node_type, exclude_node_id):
+    """Gather edges for a neighbor node, excluding edges back to the original query protein"""
+    node_idx = data[node_type]['id_mapping'][node_id]
+    reverse_id_mapping = {}
+    for ntype in data.node_types:
+        reverse_id_mapping[ntype] = {v:k for k, v in data[ntype]['id_mapping'].items()}
+    neighbor_edges = {}
+    for edge_type in data.edge_types:
+        if 'rev' not in edge_type[1]:
+            if edge_type not in neighbor_edges:
+                neighbor_edges[edge_type] = []
+            if edge_type[0] == node_type:
+                # Get edges where neighbor is source
+                edges = data[edge_type].edge_index[:, data[edge_type].edge_index[0] == node_idx]
+                edges = edges.T.tolist()
+                # Filter out edges going back to the query protein
+                edges = [edge for edge in edges if reverse_id_mapping[edge_type[2]][edge[1]] != exclude_node_id]
+                neighbor_edges[edge_type].extend(edges)
+            elif edge_type[2] == node_type:
+                # Get edges where neighbor is target
+                edges = data[edge_type].edge_index[:, data[edge_type].edge_index[1] == node_idx]
+                edges = edges.T.tolist()
+                # Filter out edges coming from the query protein
+                edges = [edge for edge in edges if reverse_id_mapping[edge_type[0]][edge[0]] != exclude_node_id]
+                neighbor_edges[edge_type].extend(edges)
+    # Map indices back to IDs
+    for edge_type in neighbor_edges.keys():
+        if neighbor_edges[edge_type]:
+            mapped_edges = set()
+            for edge in neighbor_edges[edge_type]:
+                source_type, _, target_type = edge_type
+                source_id = reverse_id_mapping[source_type][edge[0]]
+                target_id = reverse_id_mapping[target_type][edge[1]]
+                mapped_edges.add((source_id, target_id))
+            neighbor_edges[edge_type] = mapped_edges
+    return neighbor_edges
+def visualize_protein_subgraph(data, protein_id, prediction_df, limit=10, second_degree_limit=3, include_second_degree=False):
     with gzip.open('data/name_info.json.gz', 'rt', encoding='utf-8') as file:
         name_info = json.load(file)
+    # Get the first-degree edges and filter them
     protein_edges = _gather_protein_edges(data, protein_id)
+    first_degree_edges = _filter_edges(protein_id, protein_edges, prediction_df, limit)
+    # Initialize all_edges with first degree edges
+    all_edges = first_degree_edges.copy()
+    if include_second_degree:
+        # Collect neighbor nodes from first-degree edges
+        neighbor_nodes = set()
+        for edge_type, edges in first_degree_edges.items():
+            source_type, _, target_type = edge_type
+            for edge_info in edges:
+                edge = edge_info[0]
+                source, target = edge
+                if source != protein_id:
+                    neighbor_nodes.add((source, source_type))
+                if target != protein_id:
+                    neighbor_nodes.add((target, target_type))
+        # Gather and filter second-degree edges
+        second_degree_edges = {}
+        for neighbor_id, neighbor_type in neighbor_nodes:
+            neighbor_edges = _gather_neighbor_edges(data, neighbor_id, neighbor_type, protein_id)
+            filtered_neighbor_edges = _filter_edges(neighbor_id, neighbor_edges, prediction_df, second_degree_limit)
+            # Merge filtered neighbor edges into second_degree_edges
+            for edge_type, edges in filtered_neighbor_edges.items():
+                if edge_type not in second_degree_edges:
+                    second_degree_edges[edge_type] = []
+                second_degree_edges[edge_type].extend(edges)
+        # Merge first and second degree edges
+        for edge_type, edges in second_degree_edges.items():
+            if edge_type in all_edges:
+                all_edges[edge_type].extend(edges)
+            else:
+                all_edges[edge_type] = edges
+    # Update visualized_edges with all edges
+    visualized_edges = all_edges
     print(f'Edges to be visualized: {visualized_edges}')
     net = Network(height="600px", width="100%", directed=True, notebook=False)
     for edge_type, edges in visualized_edges.items():
         source_type, relation_type, target_type = edge_type
+        if relation_type in ['protein_function', 'domain_function']:
             relation_type = EDGE_LABEL_TRANSLATION[(relation_type, target_type)]
         else:
             relation_type = EDGE_LABEL_TRANSLATION[relation_type]
     # Save graph to a protein-specific file in a temporary directory
     os.makedirs('temp_viz', exist_ok=True)
+    suffix = "_with_2nd_degree" if include_second_degree else "_1st_degree"
+    file_path = os.path.join('temp_viz', f'{protein_id}_graph{suffix}.html')
     net.save_graph(file_path)