|
--- |
|
license: mit |
|
metrics: |
|
- accuracy |
|
tags: |
|
- chemistry |
|
--- |
|
# Molecular BERT Pretrained Using ChEMBL Database |
|
|
|
This model has been pretrained based on the methodology outlined in the paper [Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration](https://spj.science.org/doi/10.34133/research.0004). While the original model was initially trained using custom code, it has been adapted for use within the Hugging Face Transformers framework in this project. |
|
|
|
## Model Details |
|
The model architecture utilized is based on BERT. Here are the key configuration details: |
|
|
|
``` |
|
BertConfig( |
|
vocab_size=70, |
|
hidden_size=256, |
|
num_hidden_layers=8, |
|
num_attention_heads=8, |
|
intermediate_size=1024, |
|
hidden_act="gelu", |
|
hidden_dropout_prob=0.1, |
|
attention_probs_dropout_prob=0.1, |
|
max_position_embeddings=max_seq_len, |
|
type_vocab_size=1, |
|
pad_token_id=tokenizer_pretrained.vocab["[PAD]"], |
|
position_embedding_type="absolute" |
|
) |
|
``` |
|
|
|
- Optimizer: AdamW |
|
- Learning rate: 1e-4 |
|
- Learning rate scheduler: False |
|
- Epochs: 50 |
|
- AMP: True |
|
- GPU: Single Nvidia RTX 3090 |
|
|
|
## Pretraining Database |
|
The model was pretrained using data from the ChEMBL database, specifically version 33. You can download the database from [ChEMBL](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/). |
|
Additionally, the dataset is available on the Hugging Face Datasets Hub and can be accessed at [Hugging Face Datasets - ChEMBL_v33_pretraining](https://huggingface.co/datasets/jonghyunlee/ChEMBL_v33_pretraining/viewer/default/train). |
|
|
|
## Performance |
|
The accuracy score achieved by the pretrained model is 0.9672. The testing dataset used for evaluation constitutes 10% of the ChEMBL dataset. |
|
|
|
|