File size: 3,137 Bytes
6747ba1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
Metadata-Version: 2.1
Name: mhg-gnn
Version: 0.0
Summary: Package for mhg-gnn
Author: team
License: TBD
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/markdown
Requires-Dist: networkx>=2.8
Requires-Dist: numpy<2.0.0,>=1.23.5
Requires-Dist: pandas>=1.5.3
Requires-Dist: rdkit-pypi<2023.9.6,>=2022.9.4
Requires-Dist: torch>=2.0.0
Requires-Dist: torchinfo>=1.8.0
Requires-Dist: torch-geometric>=2.3.1

# mhg-gnn

This repository provides PyTorch source code assosiated with our publication, "MHG-GNN: Combination of Molecular Hypergraph Grammar with Graph Neural Network"

**Paper:** [Arxiv Link](https://arxiv.org/pdf/2309.16374)

For more information contact: [email protected]

![mhg-gnn](images/mhg_example1.png)

## Introduction

We present MHG-GNN, an autoencoder architecture
that has an encoder based on GNN and a decoder based on a sequential model with MHG.
Since the encoder is a GNN variant, MHG-GNN can accept any molecule as input, and  
demonstrate high predictive performance on molecular graph data.
In addition, the decoder inherits the theoretical guarantee of MHG on always generating a structurally valid molecule as output.

## Table of Contents

1. [Getting Started](#getting-started)
    1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
    2. [Replicating Conda Environment](#replicating-conda-environment)
2. [Feature Extraction](#feature-extraction)

## Getting Started

**This code and environment have been tested on Intel E5-2667 CPUs at 3.30GHz and NVIDIA A100 Tensor Core GPUs.**

### Pretrained Models and Training Logs

We provide checkpoints of the MHG-GNN model pre-trained on a dataset of ~1.34M molecules curated from PubChem. (later) For model weights: [HuggingFace Link]()

Add the MHG-GNN `pre-trained weights.pt` to the `models/` directory according to your needs. 

### Replacicating Conda Environment

Follow these steps to replicate our Conda environment and install the necessary libraries:

```
conda create --name mhg-gnn-env python=3.8.18
conda activate mhg-gnn-env
```

#### Install Packages with Conda

```
conda install -c conda-forge networkx=2.8
conda install numpy=1.23.5
# conda install -c conda-forge rdkit=2022.9.4
conda install pytorch=2.0.0 torchvision torchaudio -c pytorch
conda install -c conda-forge torchinfo=1.8.0
conda install pyg -c pyg
```

#### Install Packages with pip
```
pip install rdkit torch-nl==0.3 torch-scatter torch-sparse
```

## Feature Extraction

The example notebook [mhg-gnn_encoder_decoder_example.ipynb](notebooks/mhg-gnn_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks.

To load mhg-gnn, you can simply use:

```python
import torch
import load

model = load.load()
```

To encode SMILES into embeddings, you can use:

```python
with torch.no_grad():
    repr = model.encode(["CCO", "O=C=O", "OC(=O)c1ccccc1C(=O)O"])
```

For decoder, you can use the function, so you can return from embeddings to SMILES strings:

```python
orig = model.decode(repr)
```