gbyuvd
/

chemselfies-base-bertmlm

@@ -29,7 +29,7 @@ model-index:
       type: fill-mask
       name: Fill-Mask
     dataset:
-      name: main-eval-uniform
       type: main-eval-uniform
     metrics:
     - type: perplexity
@@ -42,7 +42,7 @@ model-index:
       type: fill-mask
       name: Fill-Mask
     dataset:
-      name: main-eval-varied
       type: main-eval-varied
     metrics:
     - type: perplexity
@@ -51,15 +51,41 @@ model-index:
     - type: accuracy
       value: 0.876
       name: MLM Accuracy
 license: cc-by-nc-sa-4.0
 ---
 # ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
 This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
   - On varied masking:
-    - Perplexity of 1.4759, MLM Accuracy of 87.60%
   - On uniform 15% masking:
-    - Perplexity of 1.3978, MLM Accuracy of 89.29%
 The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
@@ -137,6 +163,55 @@ mask_filler(text, top_k=5)
 """
 ```
 ## Background
 Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
@@ -170,21 +245,21 @@ The dataset combines two sources of molecular data:
 - These validation sets were combined into a main test set, totaling  810,108 examples.
 | Dataset    | Number of Valid Unique Molecules | Generated Training Examples |
-| ---------- | -------------------------------- | --------------------------- |
-| Chunk I    | 207,727                          | 560,859                     |
-| Chunk II   | 207,727                          | 560,859                     |
-| Chunk III  | 207,727                          | 560,859                     |
-| Chunk IV   | 207,727                          | 560,859                     |
-| Chunk V    | 207,727                          | 560,859                     |
-| Chunk VI   | 207,727                          | 560,859                     |
-| Chunk VII  | 207,727                          | 560,859                     |
-| Chunk VIII | 207,727                          | 560,859                     |
-| Chunk IX   | 207,727                          | 560,859                     |
-| Chunk X    | 207,727                          | 560,859                     |
-| Chunk XII  | 207,727                          | 560,859                     |
-| Chunk XI   | 207,727                          | 560,859                     |
-| Chunk XIII | 207,738                          | 560,889                     |
-| Total      | 2,700,462                        | 7,291,197                   |
 ### Training Procedure
@@ -274,8 +349,10 @@ This methodology aims to create a diverse and challenging dataset for masked lan
 #### Training Hyperparameters
 - Batch size = 128
-- Num of Epoch = 1
-- Total steps on all chunks = 56,966
 - Training time on each chunk = 03h:24m / ~205 mins
 I am using Ranger21 optimizer with these settings:
@@ -308,25 +385,44 @@ For more information about Ranger21, you could check out [this repository](https
 * Number of test examples: 810,108
 #### Varied Masking Test
 | Chunk   | Avg Loss | Perplexity | MLM Accuracy |
-| ------- | -------- | ---------- | ------------ |
-| I-IV    | 0.4547   | 1.5758     | 0.851        |
-| V-VIII  | 0.4224   | 1.5257     | 0.864        |
-| IX-XIII | 0.3893   | 1.4759     | 0.876        |
 #### Uniform 15% Masking Test (80%:10%:10%)
 | Chunk | Avg Loss | Perplexity | MLM Accuracy |
-| ----- | -------- | ---------- | ------------ |
-| XII   | 0.3349   | 1.3978     | 0.8929       |
 ## Interpretability
 ##### Attention Head Visualization
 (coming soon)
-##### Neural Stacks Visualization
 (coming soon)
 ##### Attributions in Determining Masked Tokens
@@ -344,13 +440,12 @@ For more information about Ranger21, you could check out [this repository](https
 ### Compute Infrastructure
-###### Hardware
-Platform: Paperspace's Gradients
-Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
-###### Software
 - Python: 3.9.13
 - Transformers: 4.42.4
@@ -423,6 +518,6 @@ G Bayu ([email protected])
 This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
-If you find my work valuable and would like to support my journey, please consider suppoting me [here](ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
 Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.

       type: fill-mask
       name: Fill-Mask
     dataset:
+      name: main-eval-uniform (Epoch 1)
       type: main-eval-uniform
     metrics:
     - type: perplexity
       type: fill-mask
       name: Fill-Mask
     dataset:
+      name: main-eval-varied (Epoch 1)
       type: main-eval-varied
     metrics:
     - type: perplexity
     - type: accuracy
       value: 0.876
       name: MLM Accuracy
+  - task:
+      type: fill-mask
+      name: Fill-Mask
+    dataset:
+      name: main-eval-varied (Epoch 2)
+      type: main-eval-varied
+    metrics:
+    - type: perplexity
+      value: 1.4029
+      name: Perplexity
+    - type: accuracy
+      value: 0.8883
+      name: MLM Accuracy
+  - task:
+      type: fill-mask
+      name: Fill-Mask
+    dataset:
+      name: main-eval-uniform (Epoch 2)
+      type: main-eval-uniform
+    metrics:
+    - type: perplexity
+      value: 1.3276
+      name: Perplexity
+    - type: accuracy
+      value: 0.9055
+      name: MLM Accuracy
 license: cc-by-nc-sa-4.0
 ---
 # ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
 This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
   - On varied masking:
+    - Perplexity of 1.4029, MLM Accuracy of 88.83%
   - On uniform 15% masking:
+    - Perplexity of 1.3276, MLM Accuracy of 90.55%
 The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
 """
 ```
+In case you have SMILES instead, you can convert it first to SELFIES.
+First install the selfies library:
+```bash
+pip install selfies
+```
+then you can convert using:
+```python
+import selfies as sf
+def smiles_to_selfies_sentence(smiles):
+    # Encode SMILES into SELFIES
+    try:
+        selfies = sf.encoder(smiles)  # Encode SMILES into SELFIES
+    except sf.EncoderError as e:
+        print(f"Encoder Error: {e}")
+        pass
+    # Split SELFIES into individual tokens
+    selfies_tokens = list(sf.split_selfies(selfies))
+    # Join dots with the nearest next tokens
+    joined_tokens = []
+    i = 0
+    while i < len(selfies_tokens):
+        if selfies_tokens[i] == '.' and i + 1 < len(selfies_tokens):
+            joined_tokens.append(f".{selfies_tokens[i+1]}")
+            i += 2
+        else:
+            joined_tokens.append(selfies_tokens[i])
+            i += 1
+    # Join tokens with a whitespace to form a sentence
+    selfies_sentence = ' '.join(joined_tokens)
+    return selfies_sentence
+# Example usage:
+in_smi = "CN(C)CCC(C1=CC=C(C=C1)Cl)C2=CC=CC=N2.C(=CC(=O)O)C(=O)O" # Chlorphenamine maleate
+selfies_sentence = smiles_to_selfies_sentence(in_smi)
+print(selfies_sentence)
+"""
+[C] [N] [Branch1] [C] [C] [C] [C] [C] [Branch1] [N] [C] [=C] [C] [=C] [Branch1] [Branch1] [C] [=C] [Ring1] [=Branch1] [Cl] [C] [=C] [C] [=C] [C] [=N] [Ring1] [=Branch1] .[C] [=Branch1] [#Branch1] [=C] [C] [=Branch1] [C] [=O] [O] [C] [=Branch1] [C] [=O] [O]
+"""
+```
 ## Background
 Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
 - These validation sets were combined into a main test set, totaling  810,108 examples.
 | Dataset    | Number of Valid Unique Molecules | Generated Training Examples |
+| ---------- | :------------------------------: | :-------------------------: |
+| Chunk I    |             207,727              |           560,859           |
+| Chunk II   |             207,727              |           560,859           |
+| Chunk III  |             207,727              |           560,859           |
+| Chunk IV   |             207,727              |           560,859           |
+| Chunk V    |             207,727              |           560,859           |
+| Chunk VI   |             207,727              |           560,859           |
+| Chunk VII  |             207,727              |           560,859           |
+| Chunk VIII |             207,727              |           560,859           |
+| Chunk IX   |             207,727              |           560,859           |
+| Chunk X    |             207,727              |           560,859           |
+| Chunk XII  |             207,727              |           560,859           |
+| Chunk XI   |             207,727              |           560,859           |
+| Chunk XIII |             207,738              |           560,889           |
+| Total      |            2,700,462             |          7,291,197          |
 ### Training Procedure
 #### Training Hyperparameters
 - Batch size = 128
+- Num of Epoch:
+  - 1 epoch for all chunks
+  - another 1 epoch on selected chunks (but contains some samples from those excluded due to overfitting tendencies)
+- Total steps on all chunks = 70,619
 - Training time on each chunk = 03h:24m / ~205 mins
 I am using Ranger21 optimizer with these settings:
 * Number of test examples: 810,108
 #### Varied Masking Test
+##### 1st Epoch
 | Chunk   | Avg Loss | Perplexity | MLM Accuracy |
+| ------- | :------: | :--------: | :----------: |
+| I-IV    |  0.4547  |   1.5758   |    0.851     |
+| V-VIII  |  0.4224  |   1.5257   |    0.864     |
+| IX-XIII |  0.3893  |   1.4759   |    0.876     |
+##### 2nd Epoch
+| Chunk | Avg Loss | Perplexity | MLM Accuracy |
+| ----- | :------: | :--------: | :----------: |
+| I-II  |  0.3659  |   1.4418   |    0.8793    |
+| VII   |  0.3386  |   1.4029   |    0.8883    |
 #### Uniform 15% Masking Test (80%:10%:10%)
+##### 1st Epoch
 | Chunk | Avg Loss | Perplexity | MLM Accuracy |
+| ----- | :------: | :--------: | :----------: |
+| XII   |  0.3349  |   1.3978   |    0.8929    |
+##### 2nd Epoch
+| Chunk | Avg Loss | Perplexity | MLM Accuracy |
+| ----- | :------: | :--------: | :----------: |
+|       |  0.2834  |   1.3276   |    0.9055    |
 ## Interpretability
+Using Acetylcholine as an example, with its protonated nitrogen masked (*[N+1]*) for visualization.
 ##### Attention Head Visualization
 (coming soon)
+##### Neuron Views
 (coming soon)
 ##### Attributions in Determining Masked Tokens
 ### Compute Infrastructure
+#### Hardware
+- Platform: Paperspace's Gradients
+- Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
+#### Software
 - Python: 3.9.13
 - Transformers: 4.42.4
 This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
+If you find my work valuable and would like to support my journey, please consider suppoting me [here](https://ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
 Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on.

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c5e990deed6f0300d37322ca0843e6ad21e629b70f5c08ce9396fcc0c73d0c1b
 size 44518452

 version https://git-lfs.github.com/spec/v1
+oid sha256:eed21113483b940971724817c724cbbbbe38aa35dc336e0527b218ef412639ac
 size 44518452

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e91f4170a40da86c836af0fab4215ae3242f4aa05ab07437cfa4ef33f9bdb7f1
 size 44557810

 version https://git-lfs.github.com/spec/v1
+oid sha256:7ecddd24834ca750b4bbd01452bee86627548c56c9bdeb7632f7e46ab21be2a6
 size 44557810