metadata

license: cc-by-nc-nd-4.0
datasets:
  - collaiborateorg/ChEMBL-MolGen-V1

Molecule Generation Model Using ChEMBL

Unlock the power of AI-driven molecule generation! For access to our model, reach out to our support team at [email protected] and start exploring limitless possibilities.

Model Overview

This model is a molecular generation model fine-tuned on our custom ChEMBL-MolGen-v1 dataset. It is designed to generate novel molecules with specific molecular properties such as molecular weight, LogP, synthetic accessibility, and the inclusion or exclusion of particular functional groups. The model is also capable of incorporating toxicity and stability constraints to generate molecules that are both safe and chemically stable.

Model Details

Model Name: Bio-ChEMBL-MolGen-Llama-3-2-1B-V1
Model Type: Transformer-based architecture, fine-tuned for molecular generation tasks.
Pretrained Model: The base model is fine-tuned on our custom ChEMBL-MolGen-v1, a database of bioactive molecules used in drug discovery.
Task: Molecule generation with constraints on:
- Molecular weight
- LogP value (lipophilicity)
- Synthetic accessibility
- Functional groups
- Toxicity and stability (including the exclusion of toxic or reactive groups)

Intended Use

This model is intended for chemoinformatics and drug discovery applications, including:

Designing novel molecules with desired physicochemical properties (e.g., molecular weight, LogP).
Optimizing molecules for drug-like properties by excluding undesirable functional groups and minimizing toxicity.
Generating molecules with consideration for environmental impact and biodegradability.
Predicting the stability of generated molecules in different conditions.

It can be used for tasks such as:

Virtual screening
Drug design and optimization
Property prediction for lead compounds
Molecule generation with specific functional or structural constraints

Model Features

Molecular Property Control: Generate molecules with targeted molecular weight, LogP, and other desired properties.
Functional Group Management: Incorporate specific functional groups and exclude those linked to toxicity (e.g., nitriles, azides).
Toxicity & Safety Constraints: Generate molecules that avoid known toxic, reactive, or carcinogenic groups.
Synthetic Accessibility Prediction: Design molecules with favorable synthetic feasibility.

How to Use

Requirements:

Python >= 3.7
Hugging Face transformers library

Example Usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name="collaiborateorg/Bio-ChEMBL-MolGen-Llama-3-2-1B-V1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.bos_token
tokenizer.pad_token_id =  tokenizer.bos_token_id
tokenizer.padding_side = 'left'

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = 'The conversation between Human and AI assistant named Collaiborator\n'
    prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_prompt.strip()}<|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_prompt.strip()}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """
    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=1000)

stream("Generate a molecule with a molecular weight around 400.2, LogP of approximately 2.8, TPSA less than 90, and no more than 2 hydrogen bond donors")

Output:
 - **Generated Molecule**:
     - SMILES: O=C(NCc1ccccc1)N1CCC[C@H]1C(=O)Nc1cccc(C(F)(F)F)c1
     - **Property Comparison**:
        - **Molecular Weight**: 400.2
        - **LogP**: 2.8
        - **TPSA**: 90.3
        - **H-bond Acceptors**: 5
        - **H-bond Donors**: 2
        - **Rotatable Bonds**: 4
        - **Ring Count**: 3
        - **QED Score**: 0.674

  **Final Molecule**:
  The generated molecule, while not perfect, should have a good balance of pharmacological properties. It has a molecular weight around 400.2, a LogP within 0.8-1.2, TPSA slightly above 90.3, 3 hydrogen bond acceptors, 2 hydrogen bond donors, 4 rotatable bonds, and a high QED score.

Input Constraints:

Molecular weight (range)
LogP value (range)
Functional groups (inclusion or exclusion)
Toxicity constraints (avoidance of specific reactive or harmful groups)

Output:

SMILES notation of the generated molecule
Molecule properties (optional): LogP, molecular weight, and other key descriptors.

Training Data

The model was fine-tuned on our custom ChEMBL-MolGen-v1, a large dataset of bioactive molecules and their bioactivity data. ChEMBL provides extensive information on molecular properties, activity data, and detailed chemical structures, enabling the model to learn the relationship between molecular structure and biological activity.

Evaluation Metrics

The model was evaluated based on the following metrics:

Molecular Property Prediction Accuracy: How well the generated molecules adhere to the target properties (e.g., molecular weight, LogP).
Functional Group Control: The model’s ability to generate molecules with the requested functional groups and avoid toxic ones.
Synthetic Accessibility: The predicted ease of synthesis for generated molecules, often measured through synthetic accessibility scores.
Toxicity and Stability: Evaluating the model's ability to avoid problematic functional groups and generate stable molecules.

Limitations

Synthetic Feasibility: While the model generates novel molecules, the feasibility of synthesizing these molecules may vary and requires further validation through experimental chemistry.
Property Approximation: While the model targets specific molecular properties, it is not always guaranteed to meet these targets exactly.
Bias in Training Data: The model is fine-tuned on ChEMBL, which may introduce biases based on the dataset's chemical space.

License

This model is available for use under the non-commercial license. For commercial usage, please contact the model author for licensing details.

Citation

If you use this model in your research or application, please cite it as follows:

@misc{Collaiborate_Bio-ChEMBL-MolGen-Llama-3-2-1B-V1, author = {ContactDoctor}, title = {Bio-ChEMBL-MolGen: A High-Performance Biomedical Language Model for Molecule Generation Using ChEMBL}, year = {2024}, howpublished = {https://huggingface.co/collaiborateorg/Bio-ChEMBL-MolGen-Llama-3-2-1B-V1}, }

Acknowledgements

This model is built on top of Hugging Face's transformers library and was fine-tuned using the ChEMBL dataset. We thank the contributors of ChEMBL for providing the data and Hugging Face for their powerful model framework.