Yin Fang
commited on
Commit
·
b5c8dfa
1
Parent(s):
a168373
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- molecular language model
|
4 |
+
- SELFIES
|
5 |
+
- molecule optimization
|
6 |
+
---
|
7 |
+
|
8 |
+
# MolGen-large-opt
|
9 |
+
MolGen-large-opt was introduced in the paper ["Molecular Language Model as Multi-task Generator"](https://arxiv.org/pdf/2301.11259.pdf) and first released in [this repository](https://github.com/zjunlp/MolGen). It is a pre-trained molecular generative model built using the 100\% robust molecular language representation, SELFIES.
|
10 |
+
|
11 |
+
## Model description
|
12 |
+
MolGen-large-opt is the first pre-trained model that only produces chemically valid molecules.
|
13 |
+
With a training corpus of over 100 million molecules in SELFIES representation, MolGen-large learns the intrinsic structural patterns of molecules by mapping corrupted SELFIES to their original forms.
|
14 |
+
Specifically, MolGen-large employs a bidirectional Transformer as its encoder and an autoregressive Transformer as its decoder.
|
15 |
+
Through its carefully designed multi-task molecular prefix tuning (MPT), MolGen-large-opt can generate molecules with desired properties, making it a valuable tool for molecular optimization.
|
16 |
+
|
17 |
+

|
18 |
+
|
19 |
+
## Intended uses
|
20 |
+
You can use the raw model for molecule generation or fine-tune it to a downstream task. Please take note that the following examples only demonstrate the utilization of our pre-trained model for molecule generation. See the [repository](https://github.com/zjunlp/MolGen) to look for fine-tune details on a task that interests you.
|
21 |
+
|
22 |
+
### How to use
|
23 |
+
Molecule generation example:
|
24 |
+
```python
|
25 |
+
>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
26 |
+
|
27 |
+
>>> tokenizer = AutoTokenizer.from_pretrained("zjunlp/MolGen-large-opt")
|
28 |
+
>>> model = AutoModelForSeq2SeqLM.from_pretrained("zjunlp/MolGen-large-opt")
|
29 |
+
|
30 |
+
>>> sf_input = tokenizer("[C][=C][C][=C][C][=C][Ring1][=Branch1]", return_tensors="pt")
|
31 |
+
>>> # beam search
|
32 |
+
>>> molecules = model.generate(input_ids=sf_input["input_ids"],
|
33 |
+
attention_mask=sf_input["attention_mask"],
|
34 |
+
max_length=15,
|
35 |
+
min_length=5,
|
36 |
+
num_return_sequences=5,
|
37 |
+
num_beams=5)
|
38 |
+
>>> sf_output = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).replace(" ","") for g in molecules]
|
39 |
+
['[C][=C][C][=C][C][=C][Ring1][=Branch1]',
|
40 |
+
'[C][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]',
|
41 |
+
'[C][=C][C][=C][C][=C][Ring1][=Branch1][C][=C][C][=C]',
|
42 |
+
'[C][=C][C][=C][C][=C][Ring1][=Branch1][C@H1][C][=C][C]',
|
43 |
+
'[C][=C][C][=C][C][=C][Ring1][=Branch1][C@H1][=C][C][=C]']
|
44 |
+
```
|
45 |
+
|
46 |
+
|
47 |
+
### BibTeX entry and citation info
|
48 |
+
```bibtex
|
49 |
+
@article{fang2023molecular,
|
50 |
+
title={Molecular Language Model as Multi-task Generator},
|
51 |
+
author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
|
52 |
+
journal={arXiv preprint arXiv:2301.11259},
|
53 |
+
year={2023}
|
54 |
+
}
|
55 |
+
```
|