Update README.md
Browse files
README.md
CHANGED
@@ -89,7 +89,7 @@ model-index:
|
|
89 |
name: Spearman Max
|
90 |
---
|
91 |
|
92 |
-
# ChEmbed v0.1
|
93 |
|
94 |
This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
|
95 |
|
@@ -184,6 +184,13 @@ print(similarities.shape)
|
|
184 |
## Limitations
|
185 |
For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
|
186 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
187 |
### Framework Versions
|
188 |
- Python: 3.9.13
|
189 |
- Sentence Transformers: 3.0.1
|
|
|
89 |
name: Spearman Max
|
90 |
---
|
91 |
|
92 |
+
# ChEmbed v0.1 - Chemical Embeddings
|
93 |
|
94 |
This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
|
95 |
|
|
|
184 |
## Limitations
|
185 |
For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
|
186 |
|
187 |
+
## Testing Generated Embeddings' Clusters
|
188 |
+
The plot below show how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
|
189 |
+
|
190 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/c8_5IWjPgbrGY0Z9-ZHop.png)
|
191 |
+
|
192 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/EHEcaSnra4lldI0LY5tGq.png)
|
193 |
+
|
194 |
### Framework Versions
|
195 |
- Python: 3.9.13
|
196 |
- Sentence Transformers: 3.0.1
|