File size: 3,248 Bytes
7d1fe42
 
e168098
 
 
 
 
 
 
 
 
7d1fe42
 
 
 
 
 
e956958
7d1fe42
 
 
e168098
7d1fe42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
tags:
- chemistry
- foundation models
- AI4Science
- materials
- molecules
- safetensors
- pytorch
- vqgan
---

# 3D Electron Density Grids-based VQGAN (3DGrid-VQGAN)

This repository provides PyTorch source code associated with our publication, "A Foundation Model for Simulation-Grade Molecular Electron Densities".

**GitHub link:** [GitHub link](https://github.com/IBM/materials/tree/main/models/3dgrid_vqgan)

For more information contact: [email protected] or [email protected].

![3dgrid-vqgan](3dgrid_vqgan/images/3dgridvqgan_architecture.png)

## Introduction

We present an encoder-decoder chemical foundation model for representing 3D electron density grids, 3DGrid-VQGAN, pre-trained on a dataset of approximately 855K molecules from PubChem database. 3DGrid-VQGAN efficiently encodes high-dimensional data into compact latent representations, enabling downstream tasks such as molecular property prediction with enhanced accuracy. This approach could significantly reduce reliance on computationally intensive quantum chemical simulations, offering simulation-grade data derived directly from learned representations.

## Table of Contents

1. [Getting Started](#getting-started)
   1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
   2. [Replicating Conda Environment](#replicating-conda-environment)
2. [Pretraining](#pretraining)
3. [Finetuning](#finetuning)
4. [Feature Extraction](#feature-extraction)
5. [Citations](#citations)

## Getting Started

**This code and environment have been tested on Nvidia V100s and Nvidia A100s**

### Pretrained Models and Training Logs

Add the 3DGrid-VQGAN `pre-trained weights.pt` to the `data/checkpoints/pretrained` directory. The directory structure should look like the following:

```
data/
└── checkpoints/
    └── pretrained/
        └── VQGAN_43.pt
```

### Replicating Conda Environment

Follow these steps to replicate our Conda environment and install the necessary libraries:

#### Create and Activate Conda Environment

```
conda create --name 3dvqgan-env python=3.10
conda activate 3dvqgan-env
```

#### Install Packages with Conda

```
conda install pytorch=2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge mpi4py=4.0.0 openmpi=5.0.5
```

#### Install Packages with Pip

```
pip install -r requirements.txt
```

## Pretraining

3DGrid-VQGAN is pre-trained on approximately 855K 3D electron density grids from PubChem, yielding approximately 7TB of data.

The pretraining code provides examples of data processing and model training on a smaller dataset.

To pre-train the 3DGrid-VQGAN model, run:

```
bash training/run_mpi_training.sh
```

## Finetuning

The finetuning datasets and environment can be found in the [finetune](finetune/) directory. After setting up the environment, you can run a finetuning task with:

```
bash finetune/run_finetune_qm9_alpha.sh
```

Finetuning training/checkpointing resources will be available in directories named `data/checkpoints/finetuned/<dataset_name>/<measure_name>`.

## Feature Extraction

To extract the embeddings from 3DGrid-VQGAN model, you can simply use:

```python
bash inference/run_extract_embeddings.sh
```

## Citations