Reconstruct Failed for new SMILES
Great work! I tried applying the same decoder to a new set of SMILES strings (see below) but the reconstructed SMILES are completely different. Any idea on how to fine-tune the model for these set of SMILES?
smiles=['CC(C)CCCC(C)CCCC(C)CCOC1OCC@@HC@@H[C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCC@@HC@H[C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCC@@HC@H[C@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCC@HC@H[C@H]1O']
This is an interesting case. First I cleaned up the syntax and canonicalized
raw_smiles = ['CC(C)CCCC(C)CCCC(C)CCOC1OCC@@HC@@H[C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCC@@HC@H[C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCC@@HC@H[C@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCC@HC@H[C@H]1O']
syntax_corrected_smiles = ['CC(C)CCCC(C)CCCC(C)CCOC1OC[C@@H][C@@H][C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OC[C@@H][C@H][C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OC[C@@H][C@H][C@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OC[C@H][C@H][C@H]1O']
canonical_corrected_smiles = ['CC(C)CCCC(C)CCCC(C)CCOC1OC[CH][CH][C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OC[CH][CH][C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OC[CH][CH][C@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OC[CH][CH][C@H]1O']
# `[CH][CH] parses as a carbon with one free radical
radical_removed_smiles = ['CC(C)CCCC(C)CCCC(C)CCOC1OCCC[C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCCC[C@@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCCC[C@H]1O',
'CC(C)CCCC(C)CCCC(C)CCOC1OCCC[C@H]1O']
The cleaned smiles generated more reconstructions that are more reasonable but still very wrong
reconstructed_smiles = ['CC(C)N(C)CCCN1CCCC[C@@H]1C',
'CC(C)N(C)CCCN1CCCC[C@@H]1C',
'CC(C)N(C)CCCN1CCCC[C@H]1C',
'CC(C)N(C)CCCN1CCCC[C@H]1C']
The reconstructed smiles show low cosine similarity ([0.6736, 0.6736, 0.6770, 0.6770]
), suggesting the problem is with the decoder model and not the encoder model. The cosine similarity values are notable because the test set showed an average cosine similarity of 0.96 for items that failed reconstruction.
The interesting question is where the issue comes from. The decoder was only trained on 30m molecules (compared to 480m for the encoder), so maybe the answer is "train more". Alternatively, I searched your compounds against ZINC (training data is from ZINC) and didn't find similar results. It's possible the issue is due to data bias in ZINC. Your compounds would fall into the DK tranche for compounds with molwt 325-350 and clogP >5. The trance only contains ~100k compounds (very small for ZINC), and most of the compounds have a small number of rotatable bonds.
I uploaded two scripts - one for data prep and one for training. Try train on your own dataset and see if that improves performance.
Thank you for the prompt response. I had suspected that the error might originate from the decoder model. As suggested, I will retrain the model and assess the performance.