πŸ’‘ Model description

This repo contains a large molecular generative model built with molecular language SELFIES.

πŸ” Intended uses

You can use the model to generate molecules from scratch (i.e., inputting the bos_token), or input a partial structure for the model to complete.

πŸ› οΈ How to use

We have provided two types of examples. You can modify the input, generation parameters, etc., according to your needs.

  • Denovo molecule generation example:
>>> from transformers import AutoTokenizer, LlamaForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("zjunlp/MolGen-7b")
>>> model = LlamaForCausalLM.from_pretrained(
                              "zjunlp/MolGen-7b",
                              load_in_8bit=True,
                              torch_dtype=torch.float16,
                              device_map="auto",
                              )
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> sf_input = tokenizer(tokenizer.bos_token, return_tensors="pt").to(device)

>>> molecules = model.generate(input_ids=sf_input["input_ids"],
                              attention_mask=sf_input["attention_mask"],
                              do_sample=True,
                              max_new_tokens=10,
                              top_p=0.75,
                              top_k=30,
                              return_dict_in_generate=False,
                              num_return_sequences=5,
                              )
>>> sf_output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).replace(" ","") for g in molecules]
['[C][C][=C][C][=C][Branch2][Ring1][=Branch2][C][=Branch1]',
'[C][N][C][C][C][Branch2][Ring2][Ring2][N][C]',
'[C][O][C][=C][C][=C][C][Branch2][Ring1][Branch1]',
'[C][N][C][C][C@H1][Branch2][Ring1][Branch2][N][Branch1]',
'[C][=C][C][Branch2][Ring1][#C][C][=Branch1][C][=O]']
  • Molecular completion example:
>>> from transformers import AutoTokenizer, LlamaForCausalLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("zjunlp/MolGen-7b")
>>> model = LlamaForCausalLM.from_pretrained(
                              "zjunlp/MolGen-7b",
                              load_in_8bit=True,
                              torch_dtype=torch.float16,
                              device_map="auto",
                              )
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> sf_input = tokenizer("[C][N][O]", return_tensors="pt").to(device)

>>> molecules = model.generate(input_ids=sf_input["input_ids"],
                              attention_mask=sf_input["attention_mask"],
                              do_sample=True,
                              max_new_tokens=10,
                              top_p=0.75,
                              top_k=30,
                              return_dict_in_generate=False,
                              num_return_sequences=5,
                              )
>>> sf_output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).replace(" ","") for g in molecules]
['[C][N][O][C][=Branch1][C][=O][/C][Ring1][=Branch1][=C][/C][=C]',
'[C][N][O][/C][=Branch1][#Branch1][=C][/N][Branch1][C][C][C][C]',
'[C][N][O][/C][=C][/C][=C][C][=Branch1][C][=O][C][=C]',
'[C][N][O][C][=Branch1][C][=O][N][Branch1][C][C][C][=Branch1]',
'[C][N][O][Ring1][Branch1][C][C][C][C][C][C][C][C]']

πŸ“š Citation

If you use our repository, please cite:

@inproceedings{fang2023domain,
  author       = {Yin Fang and
                  Ningyu Zhang and
                  Zhuo Chen and
                  Xiaohui Fan and
                  Huajun Chen},
  title        = {Domain-Agnostic Molecular Generation with Chemical Feedback},
  booktitle    = {{ICLR}},
  publisher    = {OpenReview.net},
  year         = {2024},
  url          = {https://openreview.net/pdf?id=9rPyHyjfwP}
}
Downloads last month
523
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including zjunlp/MolGen-7b