File size: 7,768 Bytes
6cdad4a
 
 
 
 
 
 
a529d7a
6cdad4a
 
1a6411b
 
51a9333
 
 
 
648f3e9
51a9333
 
648f3e9
51a9333
648f3e9
51a9333
 
 
 
 
 
 
2eeef6c
51a9333
 
 
 
648f3e9
 
 
 
 
 
1a6411b
 
51a9333
 
2eeef6c
51a9333
 
2eeef6c
51a9333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4dccda9
51a9333
2eeef6c
51a9333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2eeef6c
51a9333
 
 
 
 
 
 
1a6411b
 
 
 
 
 
 
 
 
 
 
dcbf5c4
1a6411b
a529d7a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
tags:
- biology
- genbio
- model_hub_mixin
- protein
- pytorch_model_hub_mixin
license: other
---

# AIDO.StructureTokenizer

AIDO.StructureTokenizer is a VQ-VAE-based tokenizer designed for protein structure prediction and tokenization. It encodes amino-acid-agnostic backbone structures into discrete tokens and reconstructs the full atomic-level structures, including side chains. This tokenizer facilitates the integration of 3D protein structure data with sequence-based language models, enabling efficient and accurate multimodal protein modeling.

## Model Description

![Model Architecture](./assets/images/architecture.png)

**AIDO.StructureTokenizer** is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
- Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
- Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
- Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.

This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.

### Key Features

- Encoding Structures into Tokens (See [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder))
- Decoding Tokens into Structures (See [genbio-ai/AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder))
- Reconstructing Structures (See [below](#how-to-use))
- Structure Prediction (See [this section](https://huggingface.co/genbio-ai/AIDO.Protein2StructureToken-16B/blob/main/README.md#structure-prediction) in genbio-ai/AIDO.Protein2StructureToken-16B)

## Results

### Reconstructing Structures
![Reconstruction Results](./assets/images/reconstruction.png)
### Homology Detection
![Homology Detection Results](./assets/images/homology_detection.png)
### Structure Prediction
![Structure Prediction Results](./assets/images/structure_prediction.png)

## How to Use
Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator) for more details.

### Setup
Install [Model Generator](https://github.com/genbio-ai/modelgenerator)

### Data preparation

To reproduce the reconstruction results in the paper, we provide a preprocessed CASP15 dataset at [genbio-ai/sample-structure-dataset](https://huggingface.co/datasets/genbio-ai/sample-structure-dataset). It could be downloaded via
```bash
huggingface-cli download genbio-ai/sample-structure-dataset --repo-type dataset --local-dir ./data/protstruct_sample_data/
```

This dataset is based on the CASP15 dataset, which can be referenced at:
- [CASP15 Prediction Center](https://predictioncenter.org/casp15/)
- [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15)

The downloaded directory includes:
- A `registries` folder containing a CSV file with metadata such as filenames and PDB IDs.
- A `CASP15_merged` folder containing PDB files, where domains are merged in the same way as described in [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15).

To use customized data, you can prepare a dataset with the following structure:
- A folder containing PDB files (supported formats: `cif.gz`, `cif`, `ent.gz`, `pdb`).

Then, you need to prepare a registry file in CSV format using the following command:
``` bash
python experiments/AIDO.StructureTokenizer/register_dataset.py \
    --folder_path /path/to/folder_path \
    --format cif.gz \
    --output_file /path/to/output_file.csv
```

You need to replace the `folder_path` and the `registry_path` in the following steps accordingly.

### Running Encoding and Decoding Task

If you use the provided CASP15 dataset, you can run the combined encoding and decoding task using the following command:
```bash
CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/encode_decode.yaml
```

If you use your own dataset, you need to update the `folder_path` and the `registry_path` in the `encode_decode.yaml` configuration file or override them when running the command. Example:
```bash
CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructuctureTokenizer/encode_decode.yaml \
    --data.init_args.config.proteins_datasets_configs.name="your_dataset_name" \
    --data.init_args.config.proteins_datasets_configs.registry_path="your_dataset_folder_path" \
    --data.init_args.config.proteins_datasets_configs.folder_path="your_dataset_registry_path" \
    --trainer.callbacks.dict_kwargs.output_dir="your_output_dir"
```

The input and the output can be summarized as follows:

**Input:**
- The PDB files in the dataset folder.
- The registry file in CSV format indicating the metadata of the dataset.

**Output:**
- The decoded structures and their corresponding original structures will be saved in the output directory specified in the configuration file. By default it is saved in `logs/protstruct_model/`. 
- The decoded structures end with `output.pdb`.
- The original structures end with `input.pdb`.

**Notes:**
- Decoding the structures could take a long time even when using a GPU.
- Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
- The reconstructed structures are aligned to the original structures using the Kabsch algorithm. This makes it easier to visualize and compare the structures.

### Visualizing the Reconstructed Structures

We use VS Code + [Protein Viewer Extension](https://marketplace.visualstudio.com/items?itemName=ArianJamasb.protein-viewer) to visualize the protein structures. It's a beginner-friendly tool for VS Code users. You could also use your preferred protein structure viewer to visualize the structures (e.g., PyMOL, ChimeraX, etc.), but here we focus on this extension.

If you have run the [Running Encoding and Decoding Task](#running-encoding-and-decoding-task), you could find the decoded structures and their corresponding original structures in the output directory. You could visualize them as follows.
- Find the desired `output.pdb` and `input.pdb` pair in the side panel. Select both files when holding the `Ctrl` key (for Mac users, hold the `Cmd` key). ![Select Files](./assets/images/structure_tokenizer/select_files.png)
- Right-click on the selected files and choose "Launch Protein Viewer". ![Launch Protein Viewer from File(s)](./assets/images/structure_tokenizer/launch_protein_viewer.png)
- A new tab will open with the protein structures displayed. You can interact with the structures using the Protein Viewer extension. Wwe have aligned the reconstructed structures to the original structures using the Kabsch algorithm, the displayed structures should be like this, where different colors mean different files. ![Visualize Reconstruction](./assets/images/structure_tokenizer/visualize_reconstruction.png)

# Citation
Please cite AIDO.StructureTokenizer using the following BibTex code:
```
@inproceedings{zhang_balancing_2024,
	title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer},
	url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2},
	doi = {10.1101/2024.12.02.626366},
	publisher = {bioRxiv},
	author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
	year = {2024},
    booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)},
}
```