license: cc-by-nc-nd-4.0
datasets:
- collaiborateorg/ChEMBL-MolGen-V1
Molecule Generation Model Using ChEMBL
Unlock the power of AI-driven molecule generation! For access to our model, reach out to our support team at [email protected] and start exploring limitless possibilities.
Model Overview
This model is a molecular generation model fine-tuned on our custom ChEMBL-MolGen-v1 dataset. It is designed to generate novel molecules with specific molecular properties such as molecular weight, LogP, synthetic accessibility, and the inclusion or exclusion of particular functional groups. The model is also capable of incorporating toxicity and stability constraints to generate molecules that are both safe and chemically stable.
Model Details
- Model Name:
Bio-ChEMBL-MolGen-Llama-3-2-1B-V1
- Model Type: Transformer-based architecture, fine-tuned for molecular generation tasks.
- Pretrained Model: The base model is fine-tuned on our custom ChEMBL-MolGen-v1, a database of bioactive molecules used in drug discovery.
- Task: Molecule generation with constraints on:
- Molecular weight
- LogP value (lipophilicity)
- Synthetic accessibility
- Functional groups
- Toxicity and stability (including the exclusion of toxic or reactive groups)
Intended Use
This model is intended for chemoinformatics and drug discovery applications, including:
- Designing novel molecules with desired physicochemical properties (e.g., molecular weight, LogP).
- Optimizing molecules for drug-like properties by excluding undesirable functional groups and minimizing toxicity.
- Generating molecules with consideration for environmental impact and biodegradability.
- Predicting the stability of generated molecules in different conditions.
It can be used for tasks such as:
- Virtual screening
- Drug design and optimization
- Property prediction for lead compounds
- Molecule generation with specific functional or structural constraints
Model Features
- Molecular Property Control: Generate molecules with targeted molecular weight, LogP, and other desired properties.
- Functional Group Management: Incorporate specific functional groups and exclude those linked to toxicity (e.g., nitriles, azides).
- Toxicity & Safety Constraints: Generate molecules that avoid known toxic, reactive, or carcinogenic groups.
- Synthetic Accessibility Prediction: Design molecules with favorable synthetic feasibility.
How to Use
Requirements:
- Python >= 3.7
- Hugging Face
transformers
library
Example Usage:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_name="collaiborateorg/Bio-ChEMBL-MolGen-Llama-3-2-1B-V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.bos_token
tokenizer.pad_token_id = tokenizer.bos_token_id
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
def stream(user_prompt):
runtimeFlag = "cuda:0"
system_prompt = 'The conversation between Human and AI assistant named Collaiborator\n'
prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt.strip()}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_prompt.strip()}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=1000)
stream("Generate a molecule with a molecular weight around 400.2, LogP of approximately 2.8, TPSA less than 90, and no more than 2 hydrogen bond donors")
Output:
- **Generated Molecule**:
- SMILES: O=C(NCc1ccccc1)N1CCC[C@H]1C(=O)Nc1cccc(C(F)(F)F)c1
- **Property Comparison**:
- **Molecular Weight**: 400.2
- **LogP**: 2.8
- **TPSA**: 90.3
- **H-bond Acceptors**: 5
- **H-bond Donors**: 2
- **Rotatable Bonds**: 4
- **Ring Count**: 3
- **QED Score**: 0.674
**Final Molecule**:
The generated molecule, while not perfect, should have a good balance of pharmacological properties. It has a molecular weight around 400.2, a LogP within 0.8-1.2, TPSA slightly above 90.3, 3 hydrogen bond acceptors, 2 hydrogen bond donors, 4 rotatable bonds, and a high QED score.
Input Constraints:
- Molecular weight (range)
- LogP value (range)
- Functional groups (inclusion or exclusion)
- Toxicity constraints (avoidance of specific reactive or harmful groups)
Output:
- SMILES notation of the generated molecule
- Molecule properties (optional): LogP, molecular weight, and other key descriptors.
Training Data
The model was fine-tuned on our custom ChEMBL-MolGen-v1, a large dataset of bioactive molecules and their bioactivity data. ChEMBL provides extensive information on molecular properties, activity data, and detailed chemical structures, enabling the model to learn the relationship between molecular structure and biological activity.
Evaluation Metrics
The model was evaluated based on the following metrics:
- Molecular Property Prediction Accuracy: How well the generated molecules adhere to the target properties (e.g., molecular weight, LogP).
- Functional Group Control: The model’s ability to generate molecules with the requested functional groups and avoid toxic ones.
- Synthetic Accessibility: The predicted ease of synthesis for generated molecules, often measured through synthetic accessibility scores.
- Toxicity and Stability: Evaluating the model's ability to avoid problematic functional groups and generate stable molecules.
Limitations
- Synthetic Feasibility: While the model generates novel molecules, the feasibility of synthesizing these molecules may vary and requires further validation through experimental chemistry.
- Property Approximation: While the model targets specific molecular properties, it is not always guaranteed to meet these targets exactly.
- Bias in Training Data: The model is fine-tuned on ChEMBL, which may introduce biases based on the dataset's chemical space.
License
This model is available for use under the non-commercial license. For commercial usage, please contact the model author for licensing details.
Citation
If you use this model in your research or application, please cite it as follows:
@misc{Collaiborate_Bio-ChEMBL-MolGen-Llama-3-2-1B-V1, author = {ContactDoctor}, title = {Bio-ChEMBL-MolGen: A High-Performance Biomedical Language Model for Molecule Generation Using ChEMBL}, year = {2024}, howpublished = {https://huggingface.co/collaiborateorg/Bio-ChEMBL-MolGen-Llama-3-2-1B-V1}, }
Acknowledgements
This model is built on top of Hugging Face's transformers library and was fine-tuned using the ChEMBL dataset. We thank the contributors of ChEMBL for providing the data and Hugging Face for their powerful model framework.