gbyuvd
/

chemfie-gpt-experiment-1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

gbyuvd commited on Aug 15

Commit

f65e301

•

1 Parent(s): f388626

Update README.md

Files changed (1) hide show

README.md +56 -6

README.md CHANGED Viewed

@@ -13,6 +13,56 @@ license: cc-by-nc-sa-4.0
 - Hands-on learning, research and experimentation in molecular generation
 - Baseline for ablation studies and comparisons with more advanced models
 ## Training Data
 - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
 - **Total**: 2,346,680 samples
@@ -28,12 +78,12 @@ license: cc-by-nc-sa-4.0
 ## Training Logs
-| Chunk | Training Loss | Validation Loss | Status    |
-| ----- | ------------- | --------------- | --------- |
-| I     | 1.346400      | 1.065180        | Done      |
-| II    |               |                 | Ongoing   |
-| III   |               |                 | Scheduled |
-| IV    |               |                 | Scheduled |
 ## Evaluation Results

 - Hands-on learning, research and experimentation in molecular generation
 - Baseline for ablation studies and comparisons with more advanced models
+## Use
+```python
+from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
+import torch
+tokenizer = PreTrainedTokenizerFast(
+    tokenizer_file="gpt2_tokenizer.json",
+    model_max_length=512,
+    unk_token="<unk>",
+    pad_token="<pad>",
+    eos_token="</s>",
+    bos_token="<s>",
+    mask_token="<mask>",
+)
+model = AutoModelForCausalLM.from_pretrained("gbyuvd/chemfie-gpt-experiment-1")
+# Generate some sample outputs
+def generate_molecules(model, tokenizer, num_samples=5, max_length=100):
+    model.eval()
+    generated = []
+    for _ in range(num_samples):
+        input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(model.device)
+        output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True)
+        generated.append(tokenizer.decode(output[0], skip_special_tokens=True))
+    return generated
+sample_molecules = generate_molecules(model, tokenizer)
+print("Sample generated molecules:")
+for i, mol in enumerate(sample_molecules, 1):
+    print(f"{i}. {mol}")
+```
+Tokenized SELFIES to SMILES:
+```python
+import selfies as sf
+test = "[C] [Branch1] [O] [=C] [C] [C] [C] [C] [C] [C] [C] [=Branch1] [=O] [O] [=C] [C] [C] [C] [Ring1]"
+test = test.replace(' ', '')
+print(sf.decoder(test))
+""""
+C(CCCCCCCCO)=CCC=C
+""""
+```
 ## Training Data
 - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
 - **Total**: 2,346,680 samples
 ## Training Logs
+| Chunk | Training Loss | Validation Loss |  Status   |
+| :---: | :-----------: | :-------------: | :-------: |
+|   I   |   1.346400    |    1.065180     |   Done    |
+|  II   |               |                 |  Ongoing  |
+|  III  |               |                 | Scheduled |
+|  IV   |               |                 | Scheduled |
 ## Evaluation Results