Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,78 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
|
5 |
+
# GPT2 Zinc 87m
|
6 |
+
|
7 |
+
This is a GPT2 style autoregressive language model trained on ~480m SMILES strings from the [ZINC database](https://zinc.docking.org/) available through [Huggingface](https://huggingface.co/entropy/gpt2_zinc_87m).
|
8 |
+
|
9 |
+
The model has ~87m parameters and was trained for 175000 iterations with a batch size of 3072 to a validation loss of ~.615. This model is useful for generating druglike molecules or generating embeddings from SMILES strings
|
10 |
+
|
11 |
+
## How to use
|
12 |
+
To use, install the [transformers](https://github.com/huggingface/transformers) library:
|
13 |
+
|
14 |
+
```
|
15 |
+
pip install transformers
|
16 |
+
```
|
17 |
+
|
18 |
+
Load the model from the Huggingface Hub:
|
19 |
+
|
20 |
+
```python
|
21 |
+
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
|
22 |
+
|
23 |
+
tokenizer = GPT2TokenizerFast.from_pretrained("entropy/gpt2_zinc_87m", max_len=256)
|
24 |
+
model = GPT2LMHeadModel.from_pretrained('entropy/gpt2_zinc_87m')
|
25 |
+
```
|
26 |
+
|
27 |
+
To generate molecules:
|
28 |
+
|
29 |
+
```python
|
30 |
+
inputs = torch.tensor([[tokenizer.bos_token_id]])
|
31 |
+
|
32 |
+
gen = model.generate(
|
33 |
+
inputs,
|
34 |
+
do_sample=True,
|
35 |
+
max_length=256,
|
36 |
+
temperature=1.,
|
37 |
+
early_stopping=True,
|
38 |
+
pad_token_id=tokenizer.pad_token_id,
|
39 |
+
num_return_sequences=32
|
40 |
+
)
|
41 |
+
smiles = tokenizer.batch_decode(gen, skip_special_tokens=True)
|
42 |
+
```
|
43 |
+
|
44 |
+
To compute embeddings:
|
45 |
+
|
46 |
+
```python
|
47 |
+
from transformers import DataCollatorWithPadding
|
48 |
+
|
49 |
+
collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt')
|
50 |
+
|
51 |
+
inputs = collator(tokenizer(smiles))
|
52 |
+
outputs = model(**inputs, output_hidden_states=True)
|
53 |
+
full_embeddings = outputs[-1][-1]
|
54 |
+
mask = inputs['attention_mask']
|
55 |
+
embeddings = ((full_embeddings * mask.unsqueeze(-1)).sum(1) / mask.sum(-1).unsqueeze(-1))
|
56 |
+
```
|
57 |
+
|
58 |
+
## Model Performance
|
59 |
+
|
60 |
+
To test generation performance, 1m compounds were generated at various temperature values. Generated compounds were checked for uniqueness and structural validity.
|
61 |
+
|
62 |
+
* `percent_unique` denotes `n_unique_smiles/n_total_smiles`
|
63 |
+
* `percent_valid` denotes `n_valid_smiles/n_unique_smiles`
|
64 |
+
* `percent_unique_and_valid` denotes `n_valid_smiles/n_total_smiles`
|
65 |
+
|
66 |
+
| temperature | percent_unique | percent_valid | percent_unique_and_valid |
|
67 |
+
|--------------:|-----------------:|----------------:|---------------------------:|
|
68 |
+
| 0.5 | 0.928074 | 1 | 0.928074 |
|
69 |
+
| 0.75 | 0.998468 | 0.999967 | 0.998436 |
|
70 |
+
| 1 | 0.999659 | 0.999164 | 0.998823 |
|
71 |
+
| 1.25 | 0.999514 | 0.99351 | 0.993027 |
|
72 |
+
| 1.5 | 0.998749 | 0.970223 | 0.96901 |
|
73 |
+
|
74 |
+
Property histograms computed over 1m generated compounds:
|
75 |
+

|
76 |
+
|
77 |
+
|
78 |
+
|