Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,8 @@ tags:
|
|
7 |
---
|
8 |
# chemfie-gpt-experiment-1
|
9 |
|
10 |
-
|
|
|
11 |
|
12 |
## Model Details
|
13 |
- **Model Type**: GPT-2
|
@@ -19,7 +20,8 @@ On-going training (3/4)
|
|
19 |
- Hands-on learning, research and experimentation in molecular generation
|
20 |
- Baseline for ablation studies and comparisons with more advanced models
|
21 |
|
22 |
-
##
|
|
|
23 |
Since this model doesn't use a proper GPT2 format tokenizer, special tokens still need to be set up manually (next experiment will use a proper one ofc):
|
24 |
|
25 |
```python
|
@@ -53,9 +55,19 @@ print("Sample generated molecules:")
|
|
53 |
for i, mol in enumerate(sample_molecules, 1):
|
54 |
print(f"{i}. {mol}")
|
55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
```
|
57 |
|
58 |
-
Tokenized SELFIES to SMILES
|
59 |
```python
|
60 |
import selfies as sf
|
61 |
|
@@ -69,6 +81,147 @@ C(CCCCCCCCO)=CCC=C
|
|
69 |
""""
|
70 |
```
|
71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
## Training Data
|
74 |
- **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
|
@@ -87,12 +240,12 @@ C(CCCCCCCCO)=CCC=C
|
|
87 |
## Training Logs
|
88 |
|
89 |
|
90 |
-
| Chunk | Chunk's Training Loss | Chunk's Validation Loss | Status
|
91 |
-
| :---: | :-------------------: | :---------------------: |
|
92 |
-
| I | 1.346400 | 1.065180 | Done
|
93 |
-
| II | 1.123500 | 0.993118 | Done
|
94 |
-
| III | 1.058300 | 0.948303 | Done
|
95 |
-
| IV |
|
96 |
|
97 |
|
98 |
## Evaluation Results
|
@@ -102,10 +255,17 @@ C(CCCCCCCCO)=CCC=C
|
|
102 |
- May generate unrealistic or synthetically inaccessible molecules
|
103 |
- Performance on complex, branched, and ringed molecules to be evaluated
|
104 |
|
105 |
-
## Ethical Considerations
|
106 |
-
|
107 |
-
-
|
108 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
## Additional Information
|
111 |
- Part of experimental chemfie-gpt/T5 project
|
|
|
7 |
---
|
8 |
# chemfie-gpt-experiment-1
|
9 |
|
10 |
+
This model is part of my own hands-on learning and experimentation on molecule generation, to determine which type of model is best suited for SELFIES (GPT2, T5, or by way of fill-mask).
|
11 |
+
It also serves as a baseline for future ablation and customization studies in model architecture, dataset augmentation(s), and training processes.
|
12 |
|
13 |
## Model Details
|
14 |
- **Model Type**: GPT-2
|
|
|
20 |
- Hands-on learning, research and experimentation in molecular generation
|
21 |
- Baseline for ablation studies and comparisons with more advanced models
|
22 |
|
23 |
+
## Usage
|
24 |
+
### Direct Use
|
25 |
Since this model doesn't use a proper GPT2 format tokenizer, special tokens still need to be set up manually (next experiment will use a proper one ofc):
|
26 |
|
27 |
```python
|
|
|
55 |
for i, mol in enumerate(sample_molecules, 1):
|
56 |
print(f"{i}. {mol}")
|
57 |
|
58 |
+
""""
|
59 |
+
....
|
60 |
+
2. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C] [C]
|
61 |
+
3. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [=C] [Ring1] [N]
|
62 |
+
4. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N]
|
63 |
+
5. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C]
|
64 |
+
|
65 |
+
""""
|
66 |
+
|
67 |
+
|
68 |
```
|
69 |
|
70 |
+
**Tokenized SELFIES to SMILES:**
|
71 |
```python
|
72 |
import selfies as sf
|
73 |
|
|
|
81 |
""""
|
82 |
```
|
83 |
|
84 |
+
#### Generate with Different Temperature(s) and Visualization
|
85 |
+
|
86 |
+
```python
|
87 |
+
import torch
|
88 |
+
import selfies as sf
|
89 |
+
from rdkit import Chem
|
90 |
+
from rdkit.Chem import Draw
|
91 |
+
import matplotlib.pyplot as plt
|
92 |
+
|
93 |
+
|
94 |
+
def generate_molecules(temperature, num_molecules=2):
|
95 |
+
inputs = torch.tensor([[tokenizer.bos_token_id]])
|
96 |
+
gen = model.generate(
|
97 |
+
inputs,
|
98 |
+
do_sample=True,
|
99 |
+
max_length=256,
|
100 |
+
temperature=temperature,
|
101 |
+
early_stopping=True,
|
102 |
+
pad_token_id=tokenizer.pad_token_id,
|
103 |
+
num_beams=5,
|
104 |
+
num_return_sequences=num_molecules
|
105 |
+
)
|
106 |
+
return tokenizer.batch_decode(gen, skip_special_tokens=True)
|
107 |
+
|
108 |
+
def selfies_to_smiles(selfies_str):
|
109 |
+
selfies_str = selfies_str.replace(' ', '')
|
110 |
+
try:
|
111 |
+
return sf.decoder(selfies_str)
|
112 |
+
except:
|
113 |
+
return None
|
114 |
+
|
115 |
+
def visualize_molecules(temperatures):
|
116 |
+
fig, axs = plt.subplots(len(temperatures), 2, figsize=(20, 4*len(temperatures))) # don't forget to change this args, if you want to generate more than 2 samples each
|
117 |
+
fig.suptitle("Generated Molecules at Different Temperatures", fontsize=16)
|
118 |
+
|
119 |
+
for i, temp in enumerate(temperatures):
|
120 |
+
molecules = generate_molecules(temp)
|
121 |
+
for j, mol in enumerate(molecules):
|
122 |
+
smiles = selfies_to_smiles(mol)
|
123 |
+
if smiles:
|
124 |
+
rdkit_mol = Chem.MolFromSmiles(smiles)
|
125 |
+
if rdkit_mol:
|
126 |
+
img = Draw.MolToImage(rdkit_mol)
|
127 |
+
axs[i, j].imshow(img)
|
128 |
+
axs[i, j].axis('off')
|
129 |
+
axs[i, j].set_title(f"Temp: {temp}", fontsize=10)
|
130 |
+
else:
|
131 |
+
axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center')
|
132 |
+
axs[i, j].axis('off')
|
133 |
+
else:
|
134 |
+
axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center')
|
135 |
+
axs[i, j].axis('off')
|
136 |
+
|
137 |
+
plt.tight_layout()
|
138 |
+
plt.show()
|
139 |
+
|
140 |
+
# Generate and visualize molecules at different temperatures
|
141 |
+
temperatures = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5]
|
142 |
+
visualize_molecules(temperatures)
|
143 |
+
|
144 |
+
```
|
145 |
+
**Output example:**
|
146 |
+
|
147 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/6Qxd4MgRD_isM9prx-XW3.png)
|
148 |
+
|
149 |
+
#### Generate using Starting Sequence with Different Temperature(s) and Visualization
|
150 |
+
|
151 |
+
```python
|
152 |
+
import torch
|
153 |
+
import selfies as sf
|
154 |
+
from rdkit import Chem
|
155 |
+
from rdkit.Chem import Draw
|
156 |
+
import matplotlib.pyplot as plt
|
157 |
+
|
158 |
+
|
159 |
+
def generate_molecules(seed, temperature, num_molecules=5):
|
160 |
+
# Tokenize the seed
|
161 |
+
seed_tokens = tokenizer.encode(seed, add_special_tokens=False, return_tensors="pt")
|
162 |
+
|
163 |
+
# Generate from the seed
|
164 |
+
gen = model.generate(
|
165 |
+
seed_tokens,
|
166 |
+
do_sample=True,
|
167 |
+
max_length=256,
|
168 |
+
temperature=temperature,
|
169 |
+
early_stopping=True,
|
170 |
+
pad_token_id=tokenizer.pad_token_id,
|
171 |
+
num_beams=5,
|
172 |
+
num_return_sequences=num_molecules
|
173 |
+
)
|
174 |
+
|
175 |
+
# Decode the generated sequences
|
176 |
+
generated = tokenizer.batch_decode(gen, skip_special_tokens=True)
|
177 |
+
|
178 |
+
# Combine seed with generated sequences
|
179 |
+
return [seed + seq[len(seed):] for seq in generated]
|
180 |
+
|
181 |
+
def selfies_to_smiles(selfies_str):
|
182 |
+
selfies_str = selfies_str.replace(' ', '')
|
183 |
+
try:
|
184 |
+
return sf.decoder(selfies_str)
|
185 |
+
except:
|
186 |
+
return None
|
187 |
+
|
188 |
+
def visualize_molecules(seed, temperatures):
|
189 |
+
fig, axs = plt.subplots(len(temperatures), 5, figsize=(20, 4*len(temperatures)))
|
190 |
+
fig.suptitle(f"Generated Molecules at Different Temperatures\nSeed: {seed}", fontsize=16)
|
191 |
+
|
192 |
+
for i, temp in enumerate(temperatures):
|
193 |
+
molecules = generate_molecules(seed, temp)
|
194 |
+
for j, mol in enumerate(molecules):
|
195 |
+
smiles = selfies_to_smiles(mol)
|
196 |
+
if smiles:
|
197 |
+
rdkit_mol = Chem.MolFromSmiles(smiles)
|
198 |
+
if rdkit_mol:
|
199 |
+
img = Draw.MolToImage(rdkit_mol)
|
200 |
+
axs[i, j].imshow(img)
|
201 |
+
axs[i, j].axis('off')
|
202 |
+
axs[i, j].set_title(f"Temp: {temp}", fontsize=10)
|
203 |
+
else:
|
204 |
+
axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center')
|
205 |
+
axs[i, j].axis('off')
|
206 |
+
else:
|
207 |
+
axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center')
|
208 |
+
axs[i, j].axis('off')
|
209 |
+
|
210 |
+
plt.tight_layout()
|
211 |
+
plt.show()
|
212 |
+
|
213 |
+
# Set the seed and temperatures
|
214 |
+
seed = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1]"
|
215 |
+
temperatures = [0.5, 1.0, 1.5, 2.0, 2.5]
|
216 |
+
|
217 |
+
# Generate and visualize molecules at different temperatures
|
218 |
+
visualize_molecules(seed, temperatures)
|
219 |
+
|
220 |
+
```
|
221 |
+
**Example output:**
|
222 |
+
|
223 |
+
|
224 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/cHamzqHjBj4tNxDPgdZ-g.png)
|
225 |
|
226 |
## Training Data
|
227 |
- **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
|
|
|
240 |
## Training Logs
|
241 |
|
242 |
|
243 |
+
| Chunk | Chunk's Training Loss | Chunk's Validation Loss | Status |
|
244 |
+
| :---: | :-------------------: | :---------------------: | :----: |
|
245 |
+
| I | 1.346400 | 1.065180 | Done |
|
246 |
+
| II | 1.123500 | 0.993118 | Done |
|
247 |
+
| III | 1.058300 | 0.948303 | Done |
|
248 |
+
| IV | 1.016600 | 0.921706 | Done |
|
249 |
|
250 |
|
251 |
## Evaluation Results
|
|
|
255 |
- May generate unrealistic or synthetically inaccessible molecules
|
256 |
- Performance on complex, branched, and ringed molecules to be evaluated
|
257 |
|
258 |
+
## Disclaimer & Ethical Considerations
|
259 |
+
|
260 |
+
- This model is in early development stage and may not consistently generate valid outputs.
|
261 |
+
- It is intended for personal exploration, academic, and research purposes only.
|
262 |
+
- You should be aware of potential ethical concerns:
|
263 |
+
- Possible generation of harmful substances if misused
|
264 |
+
- Potential biases inherent in the training data
|
265 |
+
- The accuracy, completeness, and reliability of the model's outputs are not guaranteed.
|
266 |
+
- This model should not be used for any commercial or legal purposes.
|
267 |
+
- The information and model provided are for educational and research use only.
|
268 |
+
|
269 |
|
270 |
## Additional Information
|
271 |
- Part of experimental chemfie-gpt/T5 project
|