entropy commited on
Commit
b2975b7
·
1 Parent(s): c889db3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -1
README.md CHANGED
@@ -8,7 +8,8 @@ tags:
8
  # Roberta Zinc Decoder
9
 
10
  This model is a GPT2 decoder model designed to reconstruct SMILES strings from embeddings created by the
11
- [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model.
 
12
 
13
  The decoder model conditions generation on mean pooled embeddings from the encoder model. Mean pooled
14
  embeddings are used to allow for integration with vector databases, which require fixed length embeddings.
@@ -62,6 +63,30 @@ gen = decoder_model.generate(
62
  reconstructed_smiles = tokenizer.batch_decode(gen, skip_special_tokens=True)
63
  ```
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ---
66
  license: mit
67
  ---
 
8
  # Roberta Zinc Decoder
9
 
10
  This model is a GPT2 decoder model designed to reconstruct SMILES strings from embeddings created by the
11
+ [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model. The decoder model was
12
+ trained on 30m compounds from the [ZINC Database](https://zinc.docking.org/).
13
 
14
  The decoder model conditions generation on mean pooled embeddings from the encoder model. Mean pooled
15
  embeddings are used to allow for integration with vector databases, which require fixed length embeddings.
 
63
  reconstructed_smiles = tokenizer.batch_decode(gen, skip_special_tokens=True)
64
  ```
65
 
66
+ ## Model Performance
67
+
68
+ The decoder model was evaluated on a test set of 1m compounds from ZINC. Compounds
69
+ were encoded with the [roberta_zinc_480m](https://huggingface.co/entropy/roberta_zinc_480m) model
70
+ and reconstructed with the decoder model.
71
+
72
+ The following metrics are computed:
73
+ * `exact_match` - percent of inputs exactly reconstructed
74
+ * `token_accuracy` - percent of output tokens exactly matching input tokens (excluding padding)
75
+ * `valid_structure` - percent of generated outputs that resolved to a valid SMILES string
76
+ * `tanimoto` - tanimoto similarity between inputs and generated outputs. Excludes invalid structures
77
+ * `cos_sim` - cosine similarity between input encoder embeddings and output encoder embeddings
78
+
79
+ `eval_type=full` reports metrics for the full 1m compound test set.
80
+
81
+ `eval_type=failed` subsets metrics for generated outputs that failed to exactly replicate the inputs.
82
+
83
+
84
+ |eval_type|exact_match|token_accuracy|valid_structure|tanimoto|cos_sim |
85
+ |---------|-----------|--------------|---------------|--------|--------|
86
+ |full |0.948277 |0.990704 |0.994278 |0.987698|0.998224|
87
+ |failed |0.000000 |0.820293 |0.889372 |0.734097|0.965668|
88
+
89
+
90
  ---
91
  license: mit
92
  ---