jonghyunlee
/

ChemBERT_ChEMBL_pretrained

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

ChemBERT_ChEMBL_pretrained / README.md

jonghyunlee's picture

Update README.md

7e55a0d over 1 year ago

|

history blame contribute delete

1.78 kB

	---
	license: mit
	metrics:
	- accuracy
	tags:
	- chemistry
	---
	# Molecular BERT Pretrained Using ChEMBL Database

	This model has been pretrained based on the methodology outlined in the paper [Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration](https://spj.science.org/doi/10.34133/research.0004). While the original model was initially trained using custom code, it has been adapted for use within the Hugging Face Transformers framework in this project.

	## Model Details
	The model architecture utilized is based on BERT. Here are the key configuration details:

	```
	BertConfig(
	vocab_size=70,
	hidden_size=256,
	num_hidden_layers=8,
	num_attention_heads=8,
	intermediate_size=1024,
	hidden_act="gelu",
	hidden_dropout_prob=0.1,
	attention_probs_dropout_prob=0.1,
	max_position_embeddings=max_seq_len,
	type_vocab_size=1,
	pad_token_id=tokenizer_pretrained.vocab["[PAD]"],
	position_embedding_type="absolute"
	)
	```

	- Optimizer: AdamW
	- Learning rate: 1e-4
	- Learning rate scheduler: False
	- Epochs: 50
	- AMP: True
	- GPU: Single Nvidia RTX 3090

	## Pretraining Database
	The model was pretrained using data from the ChEMBL database, specifically version 33. You can download the database from [ChEMBL](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/).
	Additionally, the dataset is available on the Hugging Face Datasets Hub and can be accessed at [Hugging Face Datasets - ChEMBL_v33_pretraining](https://huggingface.co/datasets/jonghyunlee/ChEMBL_v33_pretraining/viewer/default/train).

	## Performance
	The accuracy score achieved by the pretrained model is 0.9672. The testing dataset used for evaluation constitutes 10% of the ChEMBL dataset.