File size: 5,728 Bytes
a7abc43
 
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
 
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
fcca9f9
 
 
a7abc43
fcca9f9
a7abc43
fcca9f9
 
 
a7abc43
fcca9f9
 
 
a7abc43
fcca9f9
a7abc43
fcca9f9
a7abc43
 
 
 
 
 
 
 
 
fcca9f9
a7abc43
fcca9f9
a7abc43
 
 
 
 
 
 
 
 
fcca9f9
a7abc43
fcca9f9
 
 
 
 
 
 
 
 
 
 
a7abc43
 
 
fcca9f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# Model documentation & parameters

**Algorithm Version**: Which model version (either protein-target-driven or gene-expression-profile-driven) to use and which checkpoint to rely on.

**Inference type**: Whether the model should be conditioned on the target (default) or whether the model is used in an  `Unbiased` manner.

**Protein target**: An AAS of a protein target used for conditioning. Only use if `Inference type` is `Conditional` and if the `Algorithm version` is a Protein model.

**Gene expression target**: A list of 2128 floats, representing the embedding of gene expression profile to be used for conditioning. Only use if `Inference type` is `Conditional` and if the `Algorithm version` is a Omic model.

**Decoding temperature**: The temperature parameter in the SMILES/SELFIES decoder. Higher values lead to more explorative choices, smaller values culminate in mode collapse.

**Maximal sequence length**: The maximal number of SMILES tokens in the generated molecule.

**Number of samples**: How many samples should be generated (between 1 and 50).


# Model card -- PaccMannRL

**Model Details**: PaccMannRL is a language model for conditional molecular design. It consists of a domain-specific encoder (for protein targets or gene expression profiles) and a generic molecular decoder. Both components are finetuned together using RL to convert the context representation into a molecule with high affinity toward the context (i.e., binding affinity to the protein or high inhibitory effect for the cell profile).

**Developers**: Jannis Born, Matteo Manica and colleagues from IBM Research.

**Distributors**: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.

**Model date**: Published in 2021.

**Model version**: Models trained and distribuetd by the original authors.
- **Protein_v0**: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Protein encoder pretrained on 404k proteins from UniProt. Encoder and decoder finetuned on 41 SARS-CoV-2-related protein targets with a binding affinity predictor trained on BindingDB.
- **Omic_v0**: Molecular decoder pretrained on 1.5M molecules from ChEMBL. Gene expression encoder pretrained on 12k gene expression profiles from TCGA. Encoder and decoder finetuned on a few hundred cancer cell profiles from GDSC with a IC50 predictor trained on GDSC.

**Model type**: A language-based molecular generative model that can be optimized with RL to generate molecules with high affinity toward a context.

**Information about training algorithms, parameters, fairness constraints or other applied approaches, and features**: 
- **Protein**: Parameters as provided on [(GitHub repo)](https://github.com/PaccMann/paccmann_sarscov2).
- **Omics**: Parameters as provided on [(GitHub repo)](https://github.com/PaccMann/paccmann_rl).

**Paper or other resource for more information**: 
- **Protein**: [PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning (2021; *iScience*)](https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6).
- **Omics**: [Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2 (2021; *Machine Learning: Science and Technology*)](https://iopscience.iop.org/article/10.1088/2632-2153/abe808/meta).

**License**: MIT

**Where to send questions or comments about the model**: Open an issue on [GT4SD repository](https://github.com/GT4SD/gt4sd-core).

**Intended Use. Use cases that were envisioned during development**: Chemical research, in particular drug discovery.

**Primary intended uses/users**: Researchers and computational chemists using the model for model comparison or research exploration purposes.

**Out-of-scope use cases**: Production-level inference, producing molecules with harmful properties.

**Factors**: Not applicable.

**Metrics**: High reward on generating molecules with high affinity toward context. 

**Datasets**: ChEMBL, UniProt, GDSC and BindingDB (see above).

**Ethical Considerations**: Unclear, please consult with original authors in case of questions.

**Caveats and Recommendations**: Unclear, please consult with original authors in case of questions.

Model card prototype inspired by [Mitchell et al. (2019)](https://dl.acm.org/doi/abs/10.1145/3287560.3287596?casa_token=XD4eHiE2cRUAAAAA:NL11gMa1hGPOUKTAbtXnbVQBDBbjxwcjGECF_i-WC_3g1aBgU1Hbz_f2b4kI_m1in-w__1ztGeHnwHs)

## Citation

**Omics**:
```bib
@article{born2021paccmannrl,
  title = {PaccMann\textsuperscript{RL}: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning},
  journal = {iScience},
  volume = {24},
  number = {4},
  pages = {102269},
  year = {2021},
  issn = {2589-0042},
  doi = {https://doi.org/10.1016/j.isci.2021.102269},
  url = {https://www.cell.com/iscience/fulltext/S2589-0042(21)00237-6},
  author = {Born, Jannis and Manica, Matteo and Oskooei, Ali and Cadow, Joris and Markert, Greta and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a}
}
```

**Proteins**:
```bib
@article{born2021datadriven,
  author = {Born, Jannis and Manica, Matteo and Cadow, Joris and Markert, Greta and Mill, Nil Adell and Filipavicius, Modestas and Janakarajan, Nikita and Cardinale, Antonio and Laino, Teodoro and {Rodr{\'{i}}guez Mart{\'{i}}nez}, Mar{\'{i}}a},
  doi = {10.1088/2632-2153/abe808},
  issn = {2632-2153},
  journal = {Machine Learning: Science and Technology},
  number = {2},
  pages = {025024},
  title = {{Data-driven molecular design for discovery and synthesis of novel ligands: a case study on SARS-CoV-2}},
  url = {https://iopscience.iop.org/article/10.1088/2632-2153/abe808},
  volume = {2},
  year = {2021}
}
```