Update model
Browse files- README.md +129 -34
- model.safetensors +1 -1
- pytorch_model.bin +1 -1
README.md
CHANGED
@@ -29,7 +29,7 @@ model-index:
|
|
29 |
type: fill-mask
|
30 |
name: Fill-Mask
|
31 |
dataset:
|
32 |
-
name: main-eval-uniform
|
33 |
type: main-eval-uniform
|
34 |
metrics:
|
35 |
- type: perplexity
|
@@ -42,7 +42,7 @@ model-index:
|
|
42 |
type: fill-mask
|
43 |
name: Fill-Mask
|
44 |
dataset:
|
45 |
-
name: main-eval-varied
|
46 |
type: main-eval-varied
|
47 |
metrics:
|
48 |
- type: perplexity
|
@@ -51,15 +51,41 @@ model-index:
|
|
51 |
- type: accuracy
|
52 |
value: 0.876
|
53 |
name: MLM Accuracy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
license: cc-by-nc-sa-4.0
|
55 |
---
|
56 |
# ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
|
57 |
|
58 |
This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
|
59 |
- On varied masking:
|
60 |
-
- Perplexity of 1.
|
61 |
- On uniform 15% masking:
|
62 |
-
- Perplexity of 1.
|
63 |
|
64 |
The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
|
65 |
|
@@ -137,6 +163,55 @@ mask_filler(text, top_k=5)
|
|
137 |
"""
|
138 |
```
|
139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
## Background
|
141 |
Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
|
142 |
|
@@ -170,21 +245,21 @@ The dataset combines two sources of molecular data:
|
|
170 |
- These validation sets were combined into a main test set, totaling 810,108 examples.
|
171 |
|
172 |
| Dataset | Number of Valid Unique Molecules | Generated Training Examples |
|
173 |
-
| ---------- |
|
174 |
-
| Chunk I |
|
175 |
-
| Chunk II |
|
176 |
-
| Chunk III |
|
177 |
-
| Chunk IV |
|
178 |
-
| Chunk V |
|
179 |
-
| Chunk VI |
|
180 |
-
| Chunk VII |
|
181 |
-
| Chunk VIII |
|
182 |
-
| Chunk IX |
|
183 |
-
| Chunk X |
|
184 |
-
| Chunk XII |
|
185 |
-
| Chunk XI |
|
186 |
-
| Chunk XIII |
|
187 |
-
| Total |
|
188 |
|
189 |
### Training Procedure
|
190 |
|
@@ -274,8 +349,10 @@ This methodology aims to create a diverse and challenging dataset for masked lan
|
|
274 |
#### Training Hyperparameters
|
275 |
|
276 |
- Batch size = 128
|
277 |
-
- Num of Epoch
|
278 |
-
-
|
|
|
|
|
279 |
- Training time on each chunk = 03h:24m / ~205 mins
|
280 |
|
281 |
I am using Ranger21 optimizer with these settings:
|
@@ -308,25 +385,44 @@ For more information about Ranger21, you could check out [this repository](https
|
|
308 |
* Number of test examples: 810,108
|
309 |
|
310 |
#### Varied Masking Test
|
|
|
311 |
|
312 |
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
|
313 |
-
| ------- |
|
314 |
-
| I-IV |
|
315 |
-
| V-VIII |
|
316 |
-
| IX-XIII |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
317 |
|
318 |
#### Uniform 15% Masking Test (80%:10%:10%)
|
319 |
|
|
|
|
|
320 |
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
|
321 |
-
| ----- |
|
322 |
-
| XII |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
323 |
|
324 |
## Interpretability
|
325 |
|
|
|
|
|
326 |
##### Attention Head Visualization
|
327 |
(coming soon)
|
328 |
|
329 |
-
#####
|
330 |
(coming soon)
|
331 |
|
332 |
##### Attributions in Determining Masked Tokens
|
@@ -344,13 +440,12 @@ For more information about Ranger21, you could check out [this repository](https
|
|
344 |
|
345 |
### Compute Infrastructure
|
346 |
|
347 |
-
|
348 |
-
|
349 |
-
Platform: Paperspace's Gradients
|
350 |
|
351 |
-
|
|
|
352 |
|
353 |
-
|
354 |
|
355 |
- Python: 3.9.13
|
356 |
- Transformers: 4.42.4
|
@@ -423,6 +518,6 @@ G Bayu ([email protected])
|
|
423 |
|
424 |
This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
|
425 |
|
426 |
-
If you find my work valuable and would like to support my journey, please consider suppoting me [here](ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
|
427 |
|
428 |
Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.
|
|
|
29 |
type: fill-mask
|
30 |
name: Fill-Mask
|
31 |
dataset:
|
32 |
+
name: main-eval-uniform (Epoch 1)
|
33 |
type: main-eval-uniform
|
34 |
metrics:
|
35 |
- type: perplexity
|
|
|
42 |
type: fill-mask
|
43 |
name: Fill-Mask
|
44 |
dataset:
|
45 |
+
name: main-eval-varied (Epoch 1)
|
46 |
type: main-eval-varied
|
47 |
metrics:
|
48 |
- type: perplexity
|
|
|
51 |
- type: accuracy
|
52 |
value: 0.876
|
53 |
name: MLM Accuracy
|
54 |
+
- task:
|
55 |
+
type: fill-mask
|
56 |
+
name: Fill-Mask
|
57 |
+
dataset:
|
58 |
+
name: main-eval-varied (Epoch 2)
|
59 |
+
type: main-eval-varied
|
60 |
+
metrics:
|
61 |
+
- type: perplexity
|
62 |
+
value: 1.4029
|
63 |
+
name: Perplexity
|
64 |
+
- type: accuracy
|
65 |
+
value: 0.8883
|
66 |
+
name: MLM Accuracy
|
67 |
+
- task:
|
68 |
+
type: fill-mask
|
69 |
+
name: Fill-Mask
|
70 |
+
dataset:
|
71 |
+
name: main-eval-uniform (Epoch 2)
|
72 |
+
type: main-eval-uniform
|
73 |
+
metrics:
|
74 |
+
- type: perplexity
|
75 |
+
value: 1.3276
|
76 |
+
name: Perplexity
|
77 |
+
- type: accuracy
|
78 |
+
value: 0.9055
|
79 |
+
name: MLM Accuracy
|
80 |
license: cc-by-nc-sa-4.0
|
81 |
---
|
82 |
# ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
|
83 |
|
84 |
This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
|
85 |
- On varied masking:
|
86 |
+
- Perplexity of 1.4029, MLM Accuracy of 88.83%
|
87 |
- On uniform 15% masking:
|
88 |
+
- Perplexity of 1.3276, MLM Accuracy of 90.55%
|
89 |
|
90 |
The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
|
91 |
|
|
|
163 |
"""
|
164 |
```
|
165 |
|
166 |
+
In case you have SMILES instead, you can convert it first to SELFIES.
|
167 |
+
First install the selfies library:
|
168 |
+
|
169 |
+
```bash
|
170 |
+
pip install selfies
|
171 |
+
```
|
172 |
+
then you can convert using:
|
173 |
+
|
174 |
+
```python
|
175 |
+
import selfies as sf
|
176 |
+
|
177 |
+
def smiles_to_selfies_sentence(smiles):
|
178 |
+
# Encode SMILES into SELFIES
|
179 |
+
try:
|
180 |
+
selfies = sf.encoder(smiles) # Encode SMILES into SELFIES
|
181 |
+
except sf.EncoderError as e:
|
182 |
+
print(f"Encoder Error: {e}")
|
183 |
+
pass
|
184 |
+
|
185 |
+
|
186 |
+
# Split SELFIES into individual tokens
|
187 |
+
selfies_tokens = list(sf.split_selfies(selfies))
|
188 |
+
|
189 |
+
# Join dots with the nearest next tokens
|
190 |
+
joined_tokens = []
|
191 |
+
i = 0
|
192 |
+
while i < len(selfies_tokens):
|
193 |
+
if selfies_tokens[i] == '.' and i + 1 < len(selfies_tokens):
|
194 |
+
joined_tokens.append(f".{selfies_tokens[i+1]}")
|
195 |
+
i += 2
|
196 |
+
else:
|
197 |
+
joined_tokens.append(selfies_tokens[i])
|
198 |
+
i += 1
|
199 |
+
|
200 |
+
# Join tokens with a whitespace to form a sentence
|
201 |
+
selfies_sentence = ' '.join(joined_tokens)
|
202 |
+
|
203 |
+
return selfies_sentence
|
204 |
+
|
205 |
+
# Example usage:
|
206 |
+
in_smi = "CN(C)CCC(C1=CC=C(C=C1)Cl)C2=CC=CC=N2.C(=CC(=O)O)C(=O)O" # Chlorphenamine maleate
|
207 |
+
selfies_sentence = smiles_to_selfies_sentence(in_smi)
|
208 |
+
print(selfies_sentence)
|
209 |
+
|
210 |
+
"""
|
211 |
+
[C] [N] [Branch1] [C] [C] [C] [C] [C] [Branch1] [N] [C] [=C] [C] [=C] [Branch1] [Branch1] [C] [=C] [Ring1] [=Branch1] [Cl] [C] [=C] [C] [=C] [C] [=N] [Ring1] [=Branch1] .[C] [=Branch1] [#Branch1] [=C] [C] [=Branch1] [C] [=O] [O] [C] [=Branch1] [C] [=O] [O]
|
212 |
+
"""
|
213 |
+
```
|
214 |
+
|
215 |
## Background
|
216 |
Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
|
217 |
|
|
|
245 |
- These validation sets were combined into a main test set, totaling 810,108 examples.
|
246 |
|
247 |
| Dataset | Number of Valid Unique Molecules | Generated Training Examples |
|
248 |
+
| ---------- | :------------------------------: | :-------------------------: |
|
249 |
+
| Chunk I | 207,727 | 560,859 |
|
250 |
+
| Chunk II | 207,727 | 560,859 |
|
251 |
+
| Chunk III | 207,727 | 560,859 |
|
252 |
+
| Chunk IV | 207,727 | 560,859 |
|
253 |
+
| Chunk V | 207,727 | 560,859 |
|
254 |
+
| Chunk VI | 207,727 | 560,859 |
|
255 |
+
| Chunk VII | 207,727 | 560,859 |
|
256 |
+
| Chunk VIII | 207,727 | 560,859 |
|
257 |
+
| Chunk IX | 207,727 | 560,859 |
|
258 |
+
| Chunk X | 207,727 | 560,859 |
|
259 |
+
| Chunk XII | 207,727 | 560,859 |
|
260 |
+
| Chunk XI | 207,727 | 560,859 |
|
261 |
+
| Chunk XIII | 207,738 | 560,889 |
|
262 |
+
| Total | 2,700,462 | 7,291,197 |
|
263 |
|
264 |
### Training Procedure
|
265 |
|
|
|
349 |
#### Training Hyperparameters
|
350 |
|
351 |
- Batch size = 128
|
352 |
+
- Num of Epoch:
|
353 |
+
- 1 epoch for all chunks
|
354 |
+
- another 1 epoch on selected chunks (but contains some samples from those excluded due to overfitting tendencies)
|
355 |
+
- Total steps on all chunks = 70,619
|
356 |
- Training time on each chunk = 03h:24m / ~205 mins
|
357 |
|
358 |
I am using Ranger21 optimizer with these settings:
|
|
|
385 |
* Number of test examples: 810,108
|
386 |
|
387 |
#### Varied Masking Test
|
388 |
+
##### 1st Epoch
|
389 |
|
390 |
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
|
391 |
+
| ------- | :------: | :--------: | :----------: |
|
392 |
+
| I-IV | 0.4547 | 1.5758 | 0.851 |
|
393 |
+
| V-VIII | 0.4224 | 1.5257 | 0.864 |
|
394 |
+
| IX-XIII | 0.3893 | 1.4759 | 0.876 |
|
395 |
+
|
396 |
+
##### 2nd Epoch
|
397 |
+
|
398 |
+
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
|
399 |
+
| ----- | :------: | :--------: | :----------: |
|
400 |
+
| I-II | 0.3659 | 1.4418 | 0.8793 |
|
401 |
+
| VII | 0.3386 | 1.4029 | 0.8883 |
|
402 |
|
403 |
#### Uniform 15% Masking Test (80%:10%:10%)
|
404 |
|
405 |
+
##### 1st Epoch
|
406 |
+
|
407 |
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
|
408 |
+
| ----- | :------: | :--------: | :----------: |
|
409 |
+
| XII | 0.3349 | 1.3978 | 0.8929 |
|
410 |
+
|
411 |
+
##### 2nd Epoch
|
412 |
+
|
413 |
+
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
|
414 |
+
| ----- | :------: | :--------: | :----------: |
|
415 |
+
| | 0.2834 | 1.3276 | 0.9055 |
|
416 |
+
|
417 |
|
418 |
## Interpretability
|
419 |
|
420 |
+
Using Acetylcholine as an example, with its protonated nitrogen masked (*[N+1]*) for visualization.
|
421 |
+
|
422 |
##### Attention Head Visualization
|
423 |
(coming soon)
|
424 |
|
425 |
+
##### Neuron Views
|
426 |
(coming soon)
|
427 |
|
428 |
##### Attributions in Determining Masked Tokens
|
|
|
440 |
|
441 |
### Compute Infrastructure
|
442 |
|
443 |
+
#### Hardware
|
|
|
|
|
444 |
|
445 |
+
- Platform: Paperspace's Gradients
|
446 |
+
- Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
|
447 |
|
448 |
+
#### Software
|
449 |
|
450 |
- Python: 3.9.13
|
451 |
- Transformers: 4.42.4
|
|
|
518 |
|
519 |
This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
|
520 |
|
521 |
+
If you find my work valuable and would like to support my journey, please consider suppoting me [here](https://ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
|
522 |
|
523 |
Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 44518452
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:eed21113483b940971724817c724cbbbbe38aa35dc336e0527b218ef412639ac
|
3 |
size 44518452
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 44557810
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7ecddd24834ca750b4bbd01452bee86627548c56c9bdeb7632f7e46ab21be2a6
|
3 |
size 44557810
|