Update README.md
Browse files
README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
-
|
2 |
|
3 |
**genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
|
4 |
|
5 |
-
|
6 |
|
7 |
This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
|
8 |
|
@@ -11,12 +11,12 @@ This model retains the architecture of AIDO.Protein 16B, a transformer encoder-o
|
|
11 |
</center>
|
12 |
|
13 |
|
14 |
-
|
15 |
The final output linear layer has been adapted to support a new vocabulary size:
|
16 |
- **Input Vocabulary Size**: 44 (amino acids + special tokens)
|
17 |
- **Output Vocabulary Size**: 512 (structure tokens without special tokens)
|
18 |
|
19 |
-
|
20 |
| Component | Value |
|
21 |
|-------------------------------|-------|
|
22 |
| Number of Attention Heads | 36 |
|
@@ -28,7 +28,7 @@ The final output linear layer has been adapted to support a new vocabulary size:
|
|
28 |
| Output Vocabulary Size | 512 |
|
29 |
| Context Length | 1024 |
|
30 |
|
31 |
-
|
32 |
|
33 |
The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
|
34 |
|
@@ -41,13 +41,70 @@ The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database w
|
|
41 |
- **Tokens Trained**: 4T tokens
|
42 |
- **Training steps**: 200k steps
|
43 |
|
44 |
-
|
45 |
|
46 |
The input sequence should be single-chain amino acid sequences.
|
47 |
|
48 |
- **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
|
49 |
- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
|
50 |
|
51 |
-
|
52 |
|
53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# genbio-ai/proteinMoE-16b-Petal
|
2 |
|
3 |
**genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
|
4 |
|
5 |
+
## Model Architecture Details
|
6 |
|
7 |
This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
|
8 |
|
|
|
11 |
</center>
|
12 |
|
13 |
|
14 |
+
### Key Differences
|
15 |
The final output linear layer has been adapted to support a new vocabulary size:
|
16 |
- **Input Vocabulary Size**: 44 (amino acids + special tokens)
|
17 |
- **Output Vocabulary Size**: 512 (structure tokens without special tokens)
|
18 |
|
19 |
+
### Architecture Parameters
|
20 |
| Component | Value |
|
21 |
|-------------------------------|-------|
|
22 |
| Number of Attention Heads | 36 |
|
|
|
28 |
| Output Vocabulary Size | 512 |
|
29 |
| Context Length | 1024 |
|
30 |
|
31 |
+
## Training Details
|
32 |
|
33 |
The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
|
34 |
|
|
|
41 |
- **Tokens Trained**: 4T tokens
|
42 |
- **Training steps**: 200k steps
|
43 |
|
44 |
+
## Tokenization
|
45 |
|
46 |
The input sequence should be single-chain amino acid sequences.
|
47 |
|
48 |
- **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
|
49 |
- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
|
50 |
|
51 |
+
## Results
|
52 |
|
53 |
+
<center><img src="StructurePrediction.PNG" alt="Structure Prediction Results" style="width:40%; height:auto;" /></center>
|
54 |
+
|
55 |
+
## How to use
|
56 |
+
### Build any downstream models from this backbone
|
57 |
+
The usage of this model is the same as **genbio-ai/proteinMoE-16b**. You only need to change the `model.backbone` to `proteinfm_struct_token`.
|
58 |
+
|
59 |
+
#### Embedding
|
60 |
+
```python
|
61 |
+
from genbio_finetune.tasks import Embed
|
62 |
+
model = Embed.from_config({"model.backbone": "proteinfm_struct_token"}).eval()
|
63 |
+
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
|
64 |
+
embedding = model(collated_batch)
|
65 |
+
print(embedding.shape)
|
66 |
+
print(embedding)
|
67 |
+
```
|
68 |
+
|
69 |
+
#### Sequence Level Classification
|
70 |
+
```python
|
71 |
+
import torch
|
72 |
+
from genbio_finetune.tasks import SequenceClassification
|
73 |
+
model = SequenceClassification.from_config({"model.backbone": "proteinfm_struct_token", "model.n_classes": 2}).eval()
|
74 |
+
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
|
75 |
+
logits = model(collated_batch)
|
76 |
+
print(logits)
|
77 |
+
print(torch.argmax(logits, dim=-1))
|
78 |
+
```
|
79 |
+
|
80 |
+
#### Token Level Classification
|
81 |
+
```python
|
82 |
+
import torch
|
83 |
+
from genbio_finetune.tasks import TokenClassification
|
84 |
+
model = TokenClassification.from_config({"model.backbone": "proteinfm_struct_token", "model.n_classes": 3}).eval()
|
85 |
+
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
|
86 |
+
logits = model(collated_batch)
|
87 |
+
print(logits)
|
88 |
+
print(torch.argmax(logits, dim=-1))
|
89 |
+
```
|
90 |
+
|
91 |
+
#### Regression
|
92 |
+
```python
|
93 |
+
from genbio_finetune.tasks import SequenceRegression
|
94 |
+
model = SequenceRegression.from_config({"model.backbone": "proteinfm_struct_token"}).eval()
|
95 |
+
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
|
96 |
+
logits = model(collated_batch)
|
97 |
+
print(logits)
|
98 |
+
```
|
99 |
+
|
100 |
+
#### Or use our one-liner CLI to finetune or evaluate any of the above!
|
101 |
+
```
|
102 |
+
gbft fit --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
|
103 |
+
gbft test --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
|
104 |
+
```
|
105 |
+
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
|
106 |
+
|
107 |
+
|
108 |
+
## Or use our one-liner CLI to finetune or evaluate any of the above
|
109 |
+
|
110 |
+
For more information, visit: Model Generator
|