Spaces:
Running
Model card -- Regression Transformer
Parameters
Algorithm Version:
Which model checkpoint to use (trained on different datasets).
Task
Whether the multitask model should be used for property prediction or conditional generation (default).
Input
The input sequence. In the default setting (where Task
is Generate and Sampling Wrapper
is True) this can be a seed SMILES (for the molecule models) or amino-acid sequence (for the protein models). The model will locally adapt the seed sequence by masking Fraction to mask
of the tokens.
If the Task
is Predict, the sequences are given as SELFIES for the molecule models. Moreover, the tokens that should be predicted ([MASK]
in the input) have to be given explicitly. Populate the examples to understand better.
NOTE: When setting Task
to Generate, and Sampling Wrapper
to False, the user has maximal control about the generative process and can explicitly decide which tokens should be masked.
Number of samples
How many samples should be generated (between 1 and 50). If Task
is Predict, this has to be set to 1.
Search
Decoding search method. Use Sample if Task
is Generate. If Task
is Predict, use Greedy.
Tolerance
Precision tolerance; only used if Task
is Generate. This is a single float between 0 and 100 for the the tolerated deviation between desired/primed property and predicted property of the generated molecule. Given in percentage with respect to the property range encountered during training.
NOTE: The tolerance is only used for post-hoc filtering of the generated samples.
Sampling Wrapper
Only used if Task
is Generate. If set to False, the user has to provide a full RT-sequence as Input
and has to explicitly decide which tokens are masked (see example below). This gives full control but is tedious. Instead, if Sampling Wrapper
is set to True, the RT stochastically determines which parts of the sequence are masked.
NOTE: All below arguments only apply if Sampling Wrapper
is True.
Fraction to mask
Specifies the ratio of tokens that can be changed by the model. Argument only applies if Task
is Generate and Sampling Wrapper
is True.
Property goal
Specifies the desired target properties for the generation. Need to be given in the format <prop>:value
. If the model supports multiple properties, give them separated by a comma ,
. Argument only applies if Task
is Generate and Sampling Wrapper
is True.
Tokens to mask
Optionally specifies which tokens (atoms, bonds etc) can be masked. Please separate multiple tokens by comma (,
). If not specified, all tokens can be masked. Argument only applies if Task
is Generate and Sampling Wrapper
is True.
Substructures to mask
Optionally specifies a list of substructures that should definitely be masked (excluded from stochastic masking). Given in SMILES format. If multiple are provided, separate by comma (,
). Argument only applies if Task
is Generate and Sampling Wrapper
is True.
NOTE: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.
Substructures to keep
Optionally specifies a list of substructures that should definitely be present in the target sample (i.e., excluded from stochastic masking). Given in SMILES format. Argument only applies if Task
is Generate and Sampling Wrapper
is True.
NOTE: This keeps tokens even if they are included in tokens_to_mask
.
NOTE: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.
Citation
@article{born2022regression,
title={Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens},
author={Born, Jannis and Manica, Matteo},
journal={arXiv preprint arXiv:2202.01338},
note={Spotlight talk at ICLR workshop on Machine Learning for Drug Discovery},
year={2022}
}