gbyuvd commited on
Commit
23ec36a
1 Parent(s): f1d4506

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -13
README.md CHANGED
@@ -7,7 +7,8 @@ tags:
7
  ---
8
  # chemfie-gpt-experiment-1
9
 
10
- On-going training (3/4)
 
11
 
12
  ## Model Details
13
  - **Model Type**: GPT-2
@@ -19,7 +20,8 @@ On-going training (3/4)
19
  - Hands-on learning, research and experimentation in molecular generation
20
  - Baseline for ablation studies and comparisons with more advanced models
21
 
22
- ## Use
 
23
  Since this model doesn't use a proper GPT2 format tokenizer, special tokens still need to be set up manually (next experiment will use a proper one ofc):
24
 
25
  ```python
@@ -53,9 +55,19 @@ print("Sample generated molecules:")
53
  for i, mol in enumerate(sample_molecules, 1):
54
  print(f"{i}. {mol}")
55
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
58
- Tokenized SELFIES to SMILES:
59
  ```python
60
  import selfies as sf
61
 
@@ -69,6 +81,147 @@ C(CCCCCCCCO)=CCC=C
69
  """"
70
  ```
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  ## Training Data
74
  - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
@@ -87,12 +240,12 @@ C(CCCCCCCCO)=CCC=C
87
  ## Training Logs
88
 
89
 
90
- | Chunk | Chunk's Training Loss | Chunk's Validation Loss | Status |
91
- | :---: | :-------------------: | :---------------------: | :-----: |
92
- | I | 1.346400 | 1.065180 | Done |
93
- | II | 1.123500 | 0.993118 | Done |
94
- | III | 1.058300 | 0.948303 | Done |
95
- | IV | | | Ongoing |
96
 
97
 
98
  ## Evaluation Results
@@ -102,10 +255,17 @@ C(CCCCCCCCO)=CCC=C
102
  - May generate unrealistic or synthetically inaccessible molecules
103
  - Performance on complex, branched, and ringed molecules to be evaluated
104
 
105
- ## Ethical Considerations
106
- - Potential misuse for generating harmful or illegal substances
107
- - May produce biased results based on training data composition
108
- - The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
 
 
 
 
 
 
 
109
 
110
  ## Additional Information
111
  - Part of experimental chemfie-gpt/T5 project
 
7
  ---
8
  # chemfie-gpt-experiment-1
9
 
10
+ This model is part of my own hands-on learning and experimentation on molecule generation, to determine which type of model is best suited for SELFIES (GPT2, T5, or by way of fill-mask).
11
+ It also serves as a baseline for future ablation and customization studies in model architecture, dataset augmentation(s), and training processes.
12
 
13
  ## Model Details
14
  - **Model Type**: GPT-2
 
20
  - Hands-on learning, research and experimentation in molecular generation
21
  - Baseline for ablation studies and comparisons with more advanced models
22
 
23
+ ## Usage
24
+ ### Direct Use
25
  Since this model doesn't use a proper GPT2 format tokenizer, special tokens still need to be set up manually (next experiment will use a proper one ofc):
26
 
27
  ```python
 
55
  for i, mol in enumerate(sample_molecules, 1):
56
  print(f"{i}. {mol}")
57
 
58
+ """"
59
+ ....
60
+ 2. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C] [C]
61
+ 3. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [=C] [Ring1] [N]
62
+ 4. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N]
63
+ 5. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C]
64
+
65
+ """"
66
+
67
+
68
  ```
69
 
70
+ **Tokenized SELFIES to SMILES:**
71
  ```python
72
  import selfies as sf
73
 
 
81
  """"
82
  ```
83
 
84
+ #### Generate with Different Temperature(s) and Visualization
85
+
86
+ ```python
87
+ import torch
88
+ import selfies as sf
89
+ from rdkit import Chem
90
+ from rdkit.Chem import Draw
91
+ import matplotlib.pyplot as plt
92
+
93
+
94
+ def generate_molecules(temperature, num_molecules=2):
95
+ inputs = torch.tensor([[tokenizer.bos_token_id]])
96
+ gen = model.generate(
97
+ inputs,
98
+ do_sample=True,
99
+ max_length=256,
100
+ temperature=temperature,
101
+ early_stopping=True,
102
+ pad_token_id=tokenizer.pad_token_id,
103
+ num_beams=5,
104
+ num_return_sequences=num_molecules
105
+ )
106
+ return tokenizer.batch_decode(gen, skip_special_tokens=True)
107
+
108
+ def selfies_to_smiles(selfies_str):
109
+ selfies_str = selfies_str.replace(' ', '')
110
+ try:
111
+ return sf.decoder(selfies_str)
112
+ except:
113
+ return None
114
+
115
+ def visualize_molecules(temperatures):
116
+ fig, axs = plt.subplots(len(temperatures), 2, figsize=(20, 4*len(temperatures))) # don't forget to change this args, if you want to generate more than 2 samples each
117
+ fig.suptitle("Generated Molecules at Different Temperatures", fontsize=16)
118
+
119
+ for i, temp in enumerate(temperatures):
120
+ molecules = generate_molecules(temp)
121
+ for j, mol in enumerate(molecules):
122
+ smiles = selfies_to_smiles(mol)
123
+ if smiles:
124
+ rdkit_mol = Chem.MolFromSmiles(smiles)
125
+ if rdkit_mol:
126
+ img = Draw.MolToImage(rdkit_mol)
127
+ axs[i, j].imshow(img)
128
+ axs[i, j].axis('off')
129
+ axs[i, j].set_title(f"Temp: {temp}", fontsize=10)
130
+ else:
131
+ axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center')
132
+ axs[i, j].axis('off')
133
+ else:
134
+ axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center')
135
+ axs[i, j].axis('off')
136
+
137
+ plt.tight_layout()
138
+ plt.show()
139
+
140
+ # Generate and visualize molecules at different temperatures
141
+ temperatures = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5]
142
+ visualize_molecules(temperatures)
143
+
144
+ ```
145
+ **Output example:**
146
+
147
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/6Qxd4MgRD_isM9prx-XW3.png)
148
+
149
+ #### Generate using Starting Sequence with Different Temperature(s) and Visualization
150
+
151
+ ```python
152
+ import torch
153
+ import selfies as sf
154
+ from rdkit import Chem
155
+ from rdkit.Chem import Draw
156
+ import matplotlib.pyplot as plt
157
+
158
+
159
+ def generate_molecules(seed, temperature, num_molecules=5):
160
+ # Tokenize the seed
161
+ seed_tokens = tokenizer.encode(seed, add_special_tokens=False, return_tensors="pt")
162
+
163
+ # Generate from the seed
164
+ gen = model.generate(
165
+ seed_tokens,
166
+ do_sample=True,
167
+ max_length=256,
168
+ temperature=temperature,
169
+ early_stopping=True,
170
+ pad_token_id=tokenizer.pad_token_id,
171
+ num_beams=5,
172
+ num_return_sequences=num_molecules
173
+ )
174
+
175
+ # Decode the generated sequences
176
+ generated = tokenizer.batch_decode(gen, skip_special_tokens=True)
177
+
178
+ # Combine seed with generated sequences
179
+ return [seed + seq[len(seed):] for seq in generated]
180
+
181
+ def selfies_to_smiles(selfies_str):
182
+ selfies_str = selfies_str.replace(' ', '')
183
+ try:
184
+ return sf.decoder(selfies_str)
185
+ except:
186
+ return None
187
+
188
+ def visualize_molecules(seed, temperatures):
189
+ fig, axs = plt.subplots(len(temperatures), 5, figsize=(20, 4*len(temperatures)))
190
+ fig.suptitle(f"Generated Molecules at Different Temperatures\nSeed: {seed}", fontsize=16)
191
+
192
+ for i, temp in enumerate(temperatures):
193
+ molecules = generate_molecules(seed, temp)
194
+ for j, mol in enumerate(molecules):
195
+ smiles = selfies_to_smiles(mol)
196
+ if smiles:
197
+ rdkit_mol = Chem.MolFromSmiles(smiles)
198
+ if rdkit_mol:
199
+ img = Draw.MolToImage(rdkit_mol)
200
+ axs[i, j].imshow(img)
201
+ axs[i, j].axis('off')
202
+ axs[i, j].set_title(f"Temp: {temp}", fontsize=10)
203
+ else:
204
+ axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center')
205
+ axs[i, j].axis('off')
206
+ else:
207
+ axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center')
208
+ axs[i, j].axis('off')
209
+
210
+ plt.tight_layout()
211
+ plt.show()
212
+
213
+ # Set the seed and temperatures
214
+ seed = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1]"
215
+ temperatures = [0.5, 1.0, 1.5, 2.0, 2.5]
216
+
217
+ # Generate and visualize molecules at different temperatures
218
+ visualize_molecules(seed, temperatures)
219
+
220
+ ```
221
+ **Example output:**
222
+
223
+
224
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/cHamzqHjBj4tNxDPgdZ-g.png)
225
 
226
  ## Training Data
227
  - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
 
240
  ## Training Logs
241
 
242
 
243
+ | Chunk | Chunk's Training Loss | Chunk's Validation Loss | Status |
244
+ | :---: | :-------------------: | :---------------------: | :----: |
245
+ | I | 1.346400 | 1.065180 | Done |
246
+ | II | 1.123500 | 0.993118 | Done |
247
+ | III | 1.058300 | 0.948303 | Done |
248
+ | IV | 1.016600 | 0.921706 | Done |
249
 
250
 
251
  ## Evaluation Results
 
255
  - May generate unrealistic or synthetically inaccessible molecules
256
  - Performance on complex, branched, and ringed molecules to be evaluated
257
 
258
+ ## Disclaimer & Ethical Considerations
259
+
260
+ - This model is in early development stage and may not consistently generate valid outputs.
261
+ - It is intended for personal exploration, academic, and research purposes only.
262
+ - You should be aware of potential ethical concerns:
263
+ - Possible generation of harmful substances if misused
264
+ - Potential biases inherent in the training data
265
+ - The accuracy, completeness, and reliability of the model's outputs are not guaranteed.
266
+ - This model should not be used for any commercial or legal purposes.
267
+ - The information and model provided are for educational and research use only.
268
+
269
 
270
  ## Additional Information
271
  - Part of experimental chemfie-gpt/T5 project