ChemSolubilityBERTa / README.md
khanfs's picture
Update README.md
689905d verified
metadata
library_name: transformers
tags:
  - chemistry
  - biology
  - cheminformatics
  - materials science
license: mit
language:
  - en
metrics:
  - mse
  - r_squared
base_model:
  - seyonec/ChemBERTa-zinc-base-v1

ChemSolubilityBERTa

Model Description

ChemSolubilityBERTa is a prototype designed to predict the aqueous solubility of chemical compounds from their SMILES representations. Based on ChemBERTa, a BERT-like transformer-based architecture, ChemBERTa pre-trained on 77M SMILES strings for molecular property prediction. We adapted ChemBERTa to predict solubility values by fine-tuning ChemBERTa with the ESOL (Estimated SOLubility) dataset, a water solubility prediction dataset of 1,128 samples. A user inputs a SMILES string, and the model outputs a log solubility value (log mol/L). You can read the full paper here.

Fine-Tuning Details

  • Pretrained model: seyonec/ChemBERTa-zinc-base-v1
  • Dataset: ESOL (delaney-processed)
  • Task: Aqueous solubility prediction (log mol/L)
  • Number of training epochs: 3
  • Batch size: 16

How to Use

You can use the model to predict solubility for any molecule represented by a SMILES string:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("username/ChemSolubilityBERTa")
model = AutoModelForSequenceClassification.from_pretrained("username/ChemSolubilityBERTa")

smiles_string = "CCO"  # Example for ethanol
inputs = tokenizer(smiles_string, return_tensors='pt')
outputs = model(**inputs)
solubility = outputs.logits.item()
print(f"Predicted solubility: {solubility}")

Citation and Usage

If you use ChemSolubilityBERTa in your research or projects, please cite the following:

@misc{ChemSolubilityBERTa,
  author = {Farooq Khan},
  title = {ChemSolubilityBERTa: A Transformer-Based Model for Predicting Aqueous Solubility from SMILES},
  year = {2024},
  url = {https://huggingface.co/khanfs/ChemSolubilityBERTa}
}

License

This model is licensed under the MIT License.