Spaces:

GT4SD
/

regression_transformer

Running

App Files Files Community

jannisborn commited on Feb 9, 2023

Commit

45d9693

unverified ·

1 Parent(s): 6e90dd6

update

Browse files

Files changed (2) hide show

app.py +3 -0
model_cards/regression_transformer_article.md +3 -2

app.py CHANGED Viewed

@@ -7,6 +7,7 @@ from gt4sd.algorithms.conditional_generation.regression_transformer import (
     RegressionTransformer,
 )
 from gt4sd.algorithms.registry import ApplicationsRegistry
 from utils import (
     draw_grid_generate,
     draw_grid_predict,
@@ -95,6 +96,8 @@ def regression_transformer(
                 ]
             )
         samples = correct_samples
     if task == "Predict":
         return draw_grid_predict(samples[0], target, domain=algorithm.split(":")[0])
     else:

     RegressionTransformer,
 )
 from gt4sd.algorithms.registry import ApplicationsRegistry
+from terminator.tokenization import PolymerGraphTokenizer
 from utils import (
     draw_grid_generate,
     draw_grid_predict,
                 ]
             )
         samples = correct_samples
+    # if isinstance(config.generator.tokenizer.text_tokenizer, PolymerGraphTokenizer):
+    #     pass
     if task == "Predict":
         return draw_grid_predict(samples[0], target, domain=algorithm.split(":")[0])
     else:

model_cards/regression_transformer_article.md CHANGED Viewed

@@ -60,12 +60,13 @@ Optionally specifies a list of substructures that should definitely be present i
 **Algorithm version**: Models trained and distributed by the original authors.
 - **Molecules: QED**: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
 - **Molecules: Solubility**: QED model finetuned on the ESOL dataset from [Delaney et al (2004), *J. Chem. Inf. Comput. Sci.*](https://pubs.acs.org/doi/10.1021/ci034243x) to predict water solubility. Model trained on augmented SELFIES.
-- **Molecules: USPTO**: Model trained on 2.8M [chemical reactions](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
-- **Molecules: Polymer**: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (`<conv>`), PDI (`<pdi>`) and Molecular Weight (`<molwt>`). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. See the example for details.
 - **Molecules: Cosmo_acdl**: Model finetuned on 56k molecules with two properties (*pKa_ACDL* and *pKa_COSMO*). Model used augmented SELFIES.
 - **Molecules: Pfas**: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
 - **Molecules: Logp_and_synthesizability**: Model trained on 2.9M molecules (SELFIES) from PubChem with **two** synthetic properties, the logP (partition coefficient) and the [SCScore by Coley et al. (2018); *J. Chem. Inf. Model.*](https://pubs.acs.org/doi/full/10.1021/acs.jcim.7b00622?casa_token=JZzOrdWlQ_QAAAAA%3A3_ynCfBJRJN7wmP2gyAR0EWXY-pNW_l-SGwSSU2SGfl5v5SxcvqhoaPNDhxq4THberPoyyYqTZELD4Ck)
 - **Molecules: Crippen_logp**: Model trained on 2.9M molecules (SMILES) from PubChem, but *only* on logP (partition coefficient).
 - **Proteins: Stability**: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the [**Stability**](https://www.science.org/doi/full/10.1126/science.aan0693) dataset from the [TAPE benchmark](https://proceedings.neurips.cc/paper/2019/hash/37f65c068b7723cd7809ee2d31d7861c-Abstract.html) which has ~65k samples.
 **Model type**: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.

 **Algorithm version**: Models trained and distributed by the original authors.
 - **Molecules: QED**: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
 - **Molecules: Solubility**: QED model finetuned on the ESOL dataset from [Delaney et al (2004), *J. Chem. Inf. Comput. Sci.*](https://pubs.acs.org/doi/10.1021/ci034243x) to predict water solubility. Model trained on augmented SELFIES.
 - **Molecules: Cosmo_acdl**: Model finetuned on 56k molecules with two properties (*pKa_ACDL* and *pKa_COSMO*). Model used augmented SELFIES.
 - **Molecules: Pfas**: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
 - **Molecules: Logp_and_synthesizability**: Model trained on 2.9M molecules (SELFIES) from PubChem with **two** synthetic properties, the logP (partition coefficient) and the [SCScore by Coley et al. (2018); *J. Chem. Inf. Model.*](https://pubs.acs.org/doi/full/10.1021/acs.jcim.7b00622?casa_token=JZzOrdWlQ_QAAAAA%3A3_ynCfBJRJN7wmP2gyAR0EWXY-pNW_l-SGwSSU2SGfl5v5SxcvqhoaPNDhxq4THberPoyyYqTZELD4Ck)
 - **Molecules: Crippen_logp**: Model trained on 2.9M molecules (SMILES) from PubChem, but *only* on logP (partition coefficient).
+- **Molecules: Reactions: USPTO**: Model trained on 2.8M [chemical reactions](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
+- **Molecules: Polymers: ROP Catalyst**: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (`<conv>`), PDI (`<pdi>`) and Molecular Weight (`<molwt>`). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. Try the above UI example and see [Park et al., (2022, ChemRxiv)](https://chemrxiv.org/engage/chemrxiv/article-details/62b60865e84dd185e60214af) for details.
+- **Molecules: Polymers: Block copolymer**: Model finetuned on ~1k block copolymers with a novel string representation developed for Polymers. Model used two properties: dispersity (`<Dispersity>`) and MnGPC (`<MnGPC>`). This is the first generative model for block copolymers. Try the above UI example and see [Park et al., (2022, ChemRxiv)](https://chemrxiv.org/engage/chemrxiv/article-details/62b60865e84dd185e60214af) for details.
 - **Proteins: Stability**: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the [**Stability**](https://www.science.org/doi/full/10.1126/science.aan0693) dataset from the [TAPE benchmark](https://proceedings.neurips.cc/paper/2019/hash/37f65c068b7723cd7809ee2d31d7861c-Abstract.html) which has ~65k samples.
 **Model type**: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.