|
--- |
|
license: cc-by-nc-sa-4.0 |
|
library_name: transformers |
|
tags: |
|
- chemistry |
|
- selfies |
|
--- |
|
# chemfie-gpt-experiment-1 |
|
|
|
This model is part of my own hands-on learning and experimentation on molecule generation, to determine which type of model is best suited for SELFIES (GPT2, T5, or by way of fill-mask). |
|
It also serves as a baseline for future ablation and customization studies in model architecture, dataset augmentation(s), and training processes. |
|
|
|
## Model Details |
|
- **Model Type**: GPT-2 |
|
- **Architecture**: L8, A6, H384 |
|
- **Task**: Generation of SELFIES strings |
|
- **Language**: N/A (Chemical representation) |
|
|
|
## Personal Intended Use |
|
- Hands-on learning, research and experimentation in molecular generation |
|
- Baseline for ablation studies and comparisons with more advanced models |
|
|
|
## Usage |
|
### Direct Use |
|
Since this model doesn't use a proper GPT2 format tokenizer, special tokens still need to be set up manually (next experiment will use a proper one ofc): |
|
|
|
```python |
|
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM |
|
import torch |
|
|
|
tokenizer = PreTrainedTokenizerFast( |
|
tokenizer_file="gpt2_tokenizer.json", |
|
model_max_length=512, |
|
unk_token="<unk>", |
|
pad_token="<pad>", |
|
eos_token="</s>", |
|
bos_token="<s>", |
|
mask_token="<mask>", |
|
) |
|
|
|
model = AutoModelForCausalLM.from_pretrained("gbyuvd/chemfie-gpt-experiment-1") |
|
|
|
# Generate some sample outputs |
|
def generate_molecules(model, tokenizer, num_samples=5, max_length=100): |
|
model.eval() |
|
generated = [] |
|
for _ in range(num_samples): |
|
input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(model.device) |
|
output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True) |
|
generated.append(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
return generated |
|
|
|
sample_molecules = generate_molecules(model, tokenizer) |
|
print("Sample generated molecules:") |
|
for i, mol in enumerate(sample_molecules, 1): |
|
print(f"{i}. {mol}") |
|
|
|
"""" |
|
.... |
|
2. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C] [C] |
|
3. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [=C] [Ring1] [N] |
|
4. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] |
|
5. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C] |
|
|
|
"""" |
|
|
|
|
|
``` |
|
|
|
**Tokenized SELFIES to SMILES:** |
|
```python |
|
import selfies as sf |
|
|
|
test = "[C] [Branch1] [O] [=C] [C] [C] [C] [C] [C] [C] [C] [=Branch1] [=O] [O] [=C] [C] [C] [C] [Ring1]" |
|
test = test.replace(' ', '') |
|
print(sf.decoder(test)) |
|
|
|
"""" |
|
C(CCCCCCCCO)=CCC=C |
|
|
|
"""" |
|
``` |
|
|
|
#### Generate with Different Temperature(s) and Visualization |
|
|
|
```python |
|
import torch |
|
import selfies as sf |
|
from rdkit import Chem |
|
from rdkit.Chem import Draw |
|
import matplotlib.pyplot as plt |
|
|
|
|
|
def generate_molecules(temperature, num_molecules=2): |
|
inputs = torch.tensor([[tokenizer.bos_token_id]]) |
|
gen = model.generate( |
|
inputs, |
|
do_sample=True, |
|
max_length=256, |
|
temperature=temperature, |
|
early_stopping=True, |
|
pad_token_id=tokenizer.pad_token_id, |
|
num_beams=5, |
|
num_return_sequences=num_molecules |
|
) |
|
return tokenizer.batch_decode(gen, skip_special_tokens=True) |
|
|
|
def selfies_to_smiles(selfies_str): |
|
selfies_str = selfies_str.replace(' ', '') |
|
try: |
|
return sf.decoder(selfies_str) |
|
except: |
|
return None |
|
|
|
def visualize_molecules(temperatures): |
|
fig, axs = plt.subplots(len(temperatures), 2, figsize=(20, 4*len(temperatures))) # don't forget to change this args, if you want to generate more than 2 samples each |
|
fig.suptitle("Generated Molecules at Different Temperatures", fontsize=16) |
|
|
|
for i, temp in enumerate(temperatures): |
|
molecules = generate_molecules(temp) |
|
for j, mol in enumerate(molecules): |
|
smiles = selfies_to_smiles(mol) |
|
if smiles: |
|
rdkit_mol = Chem.MolFromSmiles(smiles) |
|
if rdkit_mol: |
|
img = Draw.MolToImage(rdkit_mol) |
|
axs[i, j].imshow(img) |
|
axs[i, j].axis('off') |
|
axs[i, j].set_title(f"Temp: {temp}", fontsize=10) |
|
else: |
|
axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center') |
|
axs[i, j].axis('off') |
|
else: |
|
axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center') |
|
axs[i, j].axis('off') |
|
|
|
plt.tight_layout() |
|
plt.show() |
|
|
|
# Generate and visualize molecules at different temperatures |
|
temperatures = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5] |
|
visualize_molecules(temperatures) |
|
|
|
``` |
|
**Output example:** |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/6Qxd4MgRD_isM9prx-XW3.png) |
|
|
|
#### Generate using Starting Sequence with Different Temperature(s) and Visualization |
|
|
|
```python |
|
import torch |
|
import selfies as sf |
|
from rdkit import Chem |
|
from rdkit.Chem import Draw |
|
import matplotlib.pyplot as plt |
|
|
|
|
|
def generate_molecules(seed, temperature, num_molecules=5): |
|
# Tokenize the seed |
|
seed_tokens = tokenizer.encode(seed, add_special_tokens=False, return_tensors="pt") |
|
|
|
# Generate from the seed |
|
gen = model.generate( |
|
seed_tokens, |
|
do_sample=True, |
|
max_length=256, |
|
temperature=temperature, |
|
early_stopping=True, |
|
pad_token_id=tokenizer.pad_token_id, |
|
num_beams=5, |
|
num_return_sequences=num_molecules |
|
) |
|
|
|
# Decode the generated sequences |
|
generated = tokenizer.batch_decode(gen, skip_special_tokens=True) |
|
|
|
# Combine seed with generated sequences |
|
return [seed + seq[len(seed):] for seq in generated] |
|
|
|
def selfies_to_smiles(selfies_str): |
|
selfies_str = selfies_str.replace(' ', '') |
|
try: |
|
return sf.decoder(selfies_str) |
|
except: |
|
return None |
|
|
|
def visualize_molecules(seed, temperatures): |
|
fig, axs = plt.subplots(len(temperatures), 5, figsize=(20, 4*len(temperatures))) |
|
fig.suptitle(f"Generated Molecules at Different Temperatures\nSeed: {seed}", fontsize=16) |
|
|
|
for i, temp in enumerate(temperatures): |
|
molecules = generate_molecules(seed, temp) |
|
for j, mol in enumerate(molecules): |
|
smiles = selfies_to_smiles(mol) |
|
if smiles: |
|
rdkit_mol = Chem.MolFromSmiles(smiles) |
|
if rdkit_mol: |
|
img = Draw.MolToImage(rdkit_mol) |
|
axs[i, j].imshow(img) |
|
axs[i, j].axis('off') |
|
axs[i, j].set_title(f"Temp: {temp}", fontsize=10) |
|
else: |
|
axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center') |
|
axs[i, j].axis('off') |
|
else: |
|
axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center') |
|
axs[i, j].axis('off') |
|
|
|
plt.tight_layout() |
|
plt.show() |
|
|
|
# Set the seed and temperatures |
|
seed = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1]" |
|
temperatures = [0.5, 1.0, 1.5, 2.0, 2.5] |
|
|
|
# Generate and visualize molecules at different temperatures |
|
visualize_molecules(seed, temperatures) |
|
|
|
``` |
|
**Example output:** |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/cHamzqHjBj4tNxDPgdZ-g.png) |
|
|
|
## Training Data |
|
- **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database |
|
- **Total**: 2,933,355 samples |
|
- **Total Train**: 2,346,680 samples |
|
- **Validation**: 293,336 samples |
|
- **Per chunk**: 586,670 train, 73,334 validation, 73,334 test |
|
- **Random seed for split**: 42 |
|
|
|
## Training Procedure |
|
- **Batch Size**: 64 |
|
- **Num Epoch for Each Chunk**: 1 |
|
- **Learning Rate**: 1.5e-5 |
|
- **Optimizer**: Ranger21 (MADGRAD-Lookahead-AdaBelief with gradient centralization, linear warm up (22%), gradient clipping, and L2 weight decay) |
|
|
|
## Training Logs |
|
|
|
|
|
| Chunk | Chunk's Training Loss | Chunk's Validation Loss | Status | |
|
| :---: | :-------------------: | :---------------------: | :----: | |
|
| I | 1.346400 | 1.065180 | Done | |
|
| II | 1.123500 | 0.993118 | Done | |
|
| III | 1.058300 | 0.948303 | Done | |
|
| IV | 1.016600 | 0.921706 | Done | |
|
|
|
|
|
## Evaluation Results |
|
[To be filled after model evaluation] |
|
|
|
## Limitations and Biases |
|
- May generate unrealistic or synthetically inaccessible molecules |
|
- Performance on complex, branched, and ringed molecules to be evaluated |
|
|
|
## Disclaimer & Ethical Considerations |
|
|
|
- This model is in early development stage and may not consistently generate valid outputs. |
|
- It is intended for personal exploration, academic, and research purposes only. |
|
- You should be aware of potential ethical concerns: |
|
- Possible generation of harmful substances if misused |
|
- Potential biases inherent in the training data |
|
- The accuracy, completeness, and reliability of the model's outputs are not guaranteed. |
|
- This model should not be used for any commercial or legal purposes. |
|
- The information and model provided are for educational and research use only. |
|
|
|
|
|
## Additional Information |
|
- Part of experimental chemfie-gpt/T5 project |
|
- Serves as a baseline for future experiments with further curated datasets, training improvements, and architectural modifications |
|
|
|
## Citation |
|
### BibTeX |
|
#### COCONUTDB |
|
```bibtex |
|
@article{sorokina2021coconut, |
|
title={COCONUT online: Collection of Open Natural Products database}, |
|
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph}, |
|
journal={Journal of Cheminformatics}, |
|
volume={13}, |
|
number={1}, |
|
pages={2}, |
|
year={2021}, |
|
doi={10.1186/s13321-020-00478-9} |
|
} |
|
``` |
|
|
|
#### ChemBL34 |
|
```bibtex |
|
@article{zdrazil2023chembl, |
|
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods}, |
|
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R}, |
|
journal={Nucleic Acids Research}, |
|
year={2023}, |
|
volume={gkad1004}, |
|
doi={10.1093/nar/gkad1004} |
|
} |
|
|
|
@misc{chembl34, |
|
title={ChemBL34}, |
|
year={2023}, |
|
doi={10.6019/CHEMBL.database.34} |
|
} |
|
``` |
|
|
|
#### SuperNatural3 |
|
```bibtex |
|
@article{Gallo2023, |
|
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P}, |
|
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}}, |
|
journal = {Nucleic Acids Research}, |
|
year = {2023}, |
|
month = jan, |
|
day = {6}, |
|
volume = {51}, |
|
number = {D1}, |
|
pages = {D654-D659}, |
|
doi = {10.1093/nar/gkac1008} |
|
} |
|
``` |
|
|