Update README.md
Browse files
README.md
CHANGED
@@ -13,6 +13,56 @@ license: cc-by-nc-sa-4.0
|
|
13 |
- Hands-on learning, research and experimentation in molecular generation
|
14 |
- Baseline for ablation studies and comparisons with more advanced models
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
## Training Data
|
17 |
- **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
|
18 |
- **Total**: 2,346,680 samples
|
@@ -28,12 +78,12 @@ license: cc-by-nc-sa-4.0
|
|
28 |
## Training Logs
|
29 |
|
30 |
|
31 |
-
| Chunk | Training Loss | Validation Loss |
|
32 |
-
|
|
33 |
-
|
|
34 |
-
|
|
35 |
-
|
|
36 |
-
|
|
37 |
|
38 |
|
39 |
## Evaluation Results
|
|
|
13 |
- Hands-on learning, research and experimentation in molecular generation
|
14 |
- Baseline for ablation studies and comparisons with more advanced models
|
15 |
|
16 |
+
## Use
|
17 |
+
|
18 |
+
```python
|
19 |
+
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
|
20 |
+
import torch
|
21 |
+
|
22 |
+
tokenizer = PreTrainedTokenizerFast(
|
23 |
+
tokenizer_file="gpt2_tokenizer.json",
|
24 |
+
model_max_length=512,
|
25 |
+
unk_token="<unk>",
|
26 |
+
pad_token="<pad>",
|
27 |
+
eos_token="</s>",
|
28 |
+
bos_token="<s>",
|
29 |
+
mask_token="<mask>",
|
30 |
+
)
|
31 |
+
|
32 |
+
model = AutoModelForCausalLM.from_pretrained("gbyuvd/chemfie-gpt-experiment-1")
|
33 |
+
|
34 |
+
# Generate some sample outputs
|
35 |
+
def generate_molecules(model, tokenizer, num_samples=5, max_length=100):
|
36 |
+
model.eval()
|
37 |
+
generated = []
|
38 |
+
for _ in range(num_samples):
|
39 |
+
input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(model.device)
|
40 |
+
output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True)
|
41 |
+
generated.append(tokenizer.decode(output[0], skip_special_tokens=True))
|
42 |
+
return generated
|
43 |
+
|
44 |
+
sample_molecules = generate_molecules(model, tokenizer)
|
45 |
+
print("Sample generated molecules:")
|
46 |
+
for i, mol in enumerate(sample_molecules, 1):
|
47 |
+
print(f"{i}. {mol}")
|
48 |
+
|
49 |
+
```
|
50 |
+
|
51 |
+
Tokenized SELFIES to SMILES:
|
52 |
+
```python
|
53 |
+
import selfies as sf
|
54 |
+
|
55 |
+
test = "[C] [Branch1] [O] [=C] [C] [C] [C] [C] [C] [C] [C] [=Branch1] [=O] [O] [=C] [C] [C] [C] [Ring1]"
|
56 |
+
test = test.replace(' ', '')
|
57 |
+
print(sf.decoder(test))
|
58 |
+
|
59 |
+
""""
|
60 |
+
C(CCCCCCCCO)=CCC=C
|
61 |
+
|
62 |
+
""""
|
63 |
+
```
|
64 |
+
|
65 |
+
|
66 |
## Training Data
|
67 |
- **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
|
68 |
- **Total**: 2,346,680 samples
|
|
|
78 |
## Training Logs
|
79 |
|
80 |
|
81 |
+
| Chunk | Training Loss | Validation Loss | Status |
|
82 |
+
| :---: | :-----------: | :-------------: | :-------: |
|
83 |
+
| I | 1.346400 | 1.065180 | Done |
|
84 |
+
| II | | | Ongoing |
|
85 |
+
| III | | | Scheduled |
|
86 |
+
| IV | | | Scheduled |
|
87 |
|
88 |
|
89 |
## Evaluation Results
|