sagawa commited on
Commit
656832f
1 Parent(s): 109fc00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -2
README.md CHANGED
@@ -1,5 +1,73 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
4
  # PubChem-10m-t5
5
- We trained T5 on SMILES from PubChem using the task of masked-language modeling (MLM), and its tokenizer is also trained on PubChem data. This model can be used for the prediction of molecules' properties, reactions, or interactions with proteins by changing the way of finetuning.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ datasets:
4
+ - sagawa/pubchem-10m-canonicalized
5
+ metrics:
6
+ - accuracy
7
+ model-index:
8
+ - name: PubChem-10m-t5
9
+ results:
10
+ - task:
11
+ name: Masked Language Modeling
12
+ type: fill-mask
13
+ dataset:
14
+ name: sagawa/pubchem-10m-canonicalized
15
+ type: sagawa/pubchem-10m-canonicalized
16
+ metrics:
17
+ - name: Accuracy
18
+ type: accuracy
19
+ value: 0.9259435534477234
20
  ---
21
+
22
  # PubChem-10m-t5
23
+
24
+ This model is a fine-tuned version of [google/t5-v1_1-base](https://huggingface.co/microsoft/deberta-base) on the sagawa/pubchem-10m-canonicalized dataset.
25
+ It achieves the following results on the evaluation set:
26
+ - Loss: 0.2121
27
+ - Accuracy: 0.9259
28
+
29
+
30
+ ## Model description
31
+
32
+ We trained t5 on SMILES from PubChem using the task of masked-language modeling (MLM). Its tokenizer is also trained on PubChem.
33
+
34
+
35
+ ## Intended uses & limitations
36
+
37
+ This model can be used for the prediction of molecules' properties, reactions, or interactions with proteins by changing the way of finetuning.
38
+
39
+ ## Training and evaluation data
40
+
41
+ We downloaded [PubChem data](https://drive.google.com/file/d/1ygYs8dy1-vxD1Vx6Ux7ftrXwZctFjpV3/view) and canonicalized them using RDKit. Then, we dropped duplicates. The total number of data is 9999960, and they were randomly split into train:validation=10:1.
42
+
43
+ ## Training procedure
44
+
45
+ ### Training hyperparameters
46
+
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 5e-03
49
+ - train_batch_size: 30
50
+ - eval_batch_size: 32
51
+ - seed: 42
52
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
53
+ - lr_scheduler_type: linear
54
+ - num_epochs: 30.0
55
+
56
+ ### Training results
57
+
58
+ | Training Loss | Step | Accuracy | Validation Loss |
59
+ |:-------------:|:------:|:--------:|:---------------:|
60
+ | 0.3866 | 25000 | 0.8830 | 0.3631 |
61
+ | 0.3352 | 50000 | 0.8996 | 0.3049 |
62
+ | 0.2834 | 75000 | 0.9057 | 0.2825 |
63
+ | 0.2685 | 100000 | 0.9099 | 0.2675 |
64
+ | 0.2591 | 125000 | 0.9124 | 0.2587 |
65
+ | 0.2620 | 150000 | 0.9144 | 0.2512 |
66
+ | 0.2806 | 175000 | 0.9161 | 0.2454 |
67
+ | 0.2468 | 200000 | 0.9179 | 0.2396 |
68
+ | 0.2669 | 225000 | 0.9194 | 0.2343 |
69
+ | 0.2611 | 250000 | 0.9210 | 0.2283 |
70
+ | 0.2346 | 275000 | 0.9226 | 0.2230 |
71
+ | 0.1972 | 300000 | 0.9238 | 0.2191 |
72
+ | 0.2344 | 325000 | 0.9250 | 0.2152 |
73
+ | 0.2164 | 350000 | 0.9259 | 0.2121 |