JiayouZhangGenbio commited on
Commit
530e4c8
·
verified ·
1 Parent(s): aa5a3b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -8
README.md CHANGED
@@ -1,8 +1,8 @@
1
- ## genbio-ai/proteinMoE-16b-Petal
2
 
3
  **genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
4
 
5
- ### Model Architecture Details
6
 
7
  This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
8
 
@@ -11,12 +11,12 @@ This model retains the architecture of AIDO.Protein 16B, a transformer encoder-o
11
  </center>
12
 
13
 
14
- #### Key Differences
15
  The final output linear layer has been adapted to support a new vocabulary size:
16
  - **Input Vocabulary Size**: 44 (amino acids + special tokens)
17
  - **Output Vocabulary Size**: 512 (structure tokens without special tokens)
18
 
19
- #### Architecture Parameters
20
  | Component | Value |
21
  |-------------------------------|-------|
22
  | Number of Attention Heads | 36 |
@@ -28,7 +28,7 @@ The final output linear layer has been adapted to support a new vocabulary size:
28
  | Output Vocabulary Size | 512 |
29
  | Context Length | 1024 |
30
 
31
- ### Training Details
32
 
33
  The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
34
 
@@ -41,13 +41,70 @@ The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database w
41
  - **Tokens Trained**: 4T tokens
42
  - **Training steps**: 200k steps
43
 
44
- ### Tokenization
45
 
46
  The input sequence should be single-chain amino acid sequences.
47
 
48
  - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
49
  - **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
50
 
51
- ### Results
52
 
53
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # genbio-ai/proteinMoE-16b-Petal
2
 
3
  **genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
4
 
5
+ ## Model Architecture Details
6
 
7
  This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
8
 
 
11
  </center>
12
 
13
 
14
+ ### Key Differences
15
  The final output linear layer has been adapted to support a new vocabulary size:
16
  - **Input Vocabulary Size**: 44 (amino acids + special tokens)
17
  - **Output Vocabulary Size**: 512 (structure tokens without special tokens)
18
 
19
+ ### Architecture Parameters
20
  | Component | Value |
21
  |-------------------------------|-------|
22
  | Number of Attention Heads | 36 |
 
28
  | Output Vocabulary Size | 512 |
29
  | Context Length | 1024 |
30
 
31
+ ## Training Details
32
 
33
  The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
34
 
 
41
  - **Tokens Trained**: 4T tokens
42
  - **Training steps**: 200k steps
43
 
44
+ ## Tokenization
45
 
46
  The input sequence should be single-chain amino acid sequences.
47
 
48
  - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
49
  - **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
50
 
51
+ ## Results
52
 
53
+ <center><img src="StructurePrediction.PNG" alt="Structure Prediction Results" style="width:40%; height:auto;" /></center>
54
+
55
+ ## How to use
56
+ ### Build any downstream models from this backbone
57
+ The usage of this model is the same as **genbio-ai/proteinMoE-16b**. You only need to change the `model.backbone` to `proteinfm_struct_token`.
58
+
59
+ #### Embedding
60
+ ```python
61
+ from genbio_finetune.tasks import Embed
62
+ model = Embed.from_config({"model.backbone": "proteinfm_struct_token"}).eval()
63
+ collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
64
+ embedding = model(collated_batch)
65
+ print(embedding.shape)
66
+ print(embedding)
67
+ ```
68
+
69
+ #### Sequence Level Classification
70
+ ```python
71
+ import torch
72
+ from genbio_finetune.tasks import SequenceClassification
73
+ model = SequenceClassification.from_config({"model.backbone": "proteinfm_struct_token", "model.n_classes": 2}).eval()
74
+ collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
75
+ logits = model(collated_batch)
76
+ print(logits)
77
+ print(torch.argmax(logits, dim=-1))
78
+ ```
79
+
80
+ #### Token Level Classification
81
+ ```python
82
+ import torch
83
+ from genbio_finetune.tasks import TokenClassification
84
+ model = TokenClassification.from_config({"model.backbone": "proteinfm_struct_token", "model.n_classes": 3}).eval()
85
+ collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
86
+ logits = model(collated_batch)
87
+ print(logits)
88
+ print(torch.argmax(logits, dim=-1))
89
+ ```
90
+
91
+ #### Regression
92
+ ```python
93
+ from genbio_finetune.tasks import SequenceRegression
94
+ model = SequenceRegression.from_config({"model.backbone": "proteinfm_struct_token"}).eval()
95
+ collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
96
+ logits = model(collated_batch)
97
+ print(logits)
98
+ ```
99
+
100
+ #### Or use our one-liner CLI to finetune or evaluate any of the above!
101
+ ```
102
+ gbft fit --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
103
+ gbft test --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
104
+ ```
105
+ For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
106
+
107
+
108
+ ## Or use our one-liner CLI to finetune or evaluate any of the above
109
+
110
+ For more information, visit: Model Generator