moses / model_cards /article.md
jannisborn's picture
update
8ae3405 unverified
|
raw
history blame
6.96 kB

Model documentation & parameters

Algorithm Version: Which algorithm to use (VAE, AAE or ORGAN).

Sequence length: Maximal length of the sequence.

Number of samples: How many samples should be generated (between 1 and 50).

Model card - MOSES

Model Details: MOSES is a benchmarking platform for molecular generation models.

Developers: Daniil Polykovskiy and colleagues from Insilico Medicine.

Distributors: Developer's code wrapped and distributed by GT4SD Team (2023) from IBM Research.

Model date: Released in 2020.

Model version: Model provided by original authors, adapted by GT4SD authors in this GitHub repo.

Model type: Three types of models:

  1. VAE: VAEs [1,2,[3],(https://arxiv.org/abs/1711.07839)] framework for training two neural networks—an encoder and a decoder—to learn a mapping from high-dimensional data representation into a lower-dimensional space and back. The lower-dimensional space is called the latent space, which is often a continuous vector space with normally distributed latent representation. VAE parameters are optimized to encode and decode data by minimizing the reconstruction loss while also minimizing a KL-divergence term arising from the variational approximation that can loosely be interpreted as a regularization term. Since molecules are discrete objects, properly trained VAE defines an invertible continuous representation of a molecule. We combine aspects from both implementations in MOSES. Utilizing a bidirectional77 Gated Recurrent Unit (GRU) with a linear output layer as an encoder. The decoder is a 3-layer GRU RNN of 512 hidden dimensions with intermediate dropout layers with dropout probability 0.2. Training is done with a batch size of 128, utilizing a gradient clipping of 50, KL-term weight of 1, and optimized with Adam with a learning rate of 0.0003 for 50 epochs.
  2. AAE: AAE [1] combine the idea of VAE with that of adversarial training as found in GAN. One of the main drawbacks of VAE is the KL divergence term that has a closed-form analytical solution only for a handful of distributions. In AAE, the KL divergence term is avoided by training a discriminator network to predict whether a given sample came from the latent space of the AE or from a prior distribution. Parameters are optimized to minimize the reconstruction loss and to minimize the discriminator loss. Kadurin et al. applied AAE architecture to the drug generation task. The model consists of an encoder with a 1-layer bidirectional LSTM with 380 hidden dimensions, a decoder with a 2-layer LSTM with 640 hidden dimensions and a shared embedding of size 32. The latent space is of dimension 640, and the discriminator networks is a 2-layer fully connected neural network with 640 and 256 nodes respectively, utilizing the ELU activation function. Training is done with a batch size of 128, with the Adam optimizer using a learning rate of 0.001 for 25 epochs.
  3. ORGAN: Objective-reinforced generative adversarial network (ORGAN) (see [1,2]) is a sequence generation model based on adversarial training that aims at generating discrete sequences that emulate a data distribution while biasing the generation process towards some desired objective rewards using reinforcement learning. ORGAN incorporates at least 2 networks: a generator network and a discriminator network. The goal of the generator network is to create synthetic data examples that are indistinguishable from the empirical data distribution. The discriminator exists to learn to distinguish synthetic from real data samples. Both models are trained in alternation. To properly train a GAN, the gradient must be backpropagated between the generator and discriminator model, which is not possible when the data samples come from a discrete distribution such as multinomial since discrete distributions are non-differentiable. SeqGAN10 proposed to learn a policy gradient that can be backpropagated and calculated using the REINFORCE81 algorithm. ORGAN extended this framework to include other reward functions besides the discriminator. Reinforcement is done with an N-depth Monte Carlo tree search, and the reward is a weighted sum of probabilities from the discriminator and objective reward. Both the generator and discriminator are pre-trained for 250 and 50 epochs respectively, and then jointly trained for 100 epochs utilizing the Adam optimizer with a learning rate of 0.0001. In the experiments, we used chemical validity and uniqueness as rewards.

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: Trained by the original authors with the default parameters provided in their paper.

Paper or other resource for more information: Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models (2020; Frontiers in Pharmacology).

License: MIT

Where to send questions or comments about the model: Open an issue on the GT4SD repo.

Intended Use. Use cases that were envisioned during development: Chemical research, in particular drug discovery.

Primary intended uses/users: Researchers and computational chemists using the model for model comparison or research exploration purposes.

Out-of-scope use cases: Production-level inference, producing molecules with harmful properties.

Factors: Not applicable.

Metrics: Validation loss on decoding correct molecules. Evaluated on several downstream tasks.

Datasets: 4M molecules from ZINC.

Ethical Considerations: Unclear, please consult with original authors in case of questions.

Caveats and Recommendations: Unclear, please consult with original authors in case of questions.

Model card prototype inspired by Mitchell et al. (2019)

Citation

@article{polykovskiy2020molecular,
  title={Molecular sets (MOSES): a benchmarking platform for molecular generation models},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and others},
  journal={Frontiers in pharmacology},
  volume={11},
  pages={565644},
  year={2020},
  publisher={Frontiers Media SA}
}