|
--- |
|
datasets: [COCONUTDB] |
|
language: [code] |
|
library_name: sentence-transformers |
|
metrics: |
|
- pearson_cosine |
|
- spearman_cosine |
|
- pearson_manhattan |
|
- spearman_manhattan |
|
- pearson_euclidean |
|
- spearman_euclidean |
|
- pearson_dot |
|
- spearman_dot |
|
- pearson_max |
|
- spearman_max |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:1,183,174 |
|
- loss:CosineSimilarityLoss |
|
widget: |
|
- source_sentence: '[O][=C][Branch2][Branch2][Ring1][O][C][C][Branch2][Ring1][=Branch1][O][C][=Branch1][C][=O][C][C][C][C][C][C][=C][C][C][C][C][C][C][C][C][C][C][O][P][=Branch1][C][=O][Branch1][C][O][O][C][C][Branch2][Ring1][Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][Branch1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][Ring1][Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][C][C][C][C][C][C][C][C][C][C][C][C][C]' |
|
sentences: |
|
- '[O][=C][Branch2][Ring1][N][N][N][=C][Branch1][N][C][=C][C][=C][Branch1][C][Cl][C][=C][Ring1][#Branch1][C][=C][C][=C][Branch1][C][Cl][C][=C][Ring1][#Branch1][C][=C][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=C][Ring1][#Branch2][O]' |
|
- '[O][=C][Branch1][C][O][C][=C][Branch1][C][C][C][C][=Branch1][C][=O][O][C]' |
|
- '[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][#C][C][Branch1][C][O][C][C][C][C]' |
|
- source_sentence: '[O][=C][Branch1][#Branch1][O][C][Branch1][C][C][C][C][C][C][C][C][C][C][C]' |
|
sentences: |
|
- '[O][=C][O][C][C][Branch1][C][O][C][C][Ring1][#Branch1][=C][C][C][C][C][C][C][C][C][C][C][C][C][C]' |
|
- '[O][=C][C][=C][C][O][C][O][C][=Ring1][Branch1][C][=C][Ring1][=Branch2][Br]' |
|
- '[O][=C][Branch2][#Branch1][=C][O][C][C][=Branch1][C][=C][C][C][Branch1][#Branch1][O][C][=Branch1][C][=O][C][C][C][C][=Branch1][C][=O][C][Branch2][=Branch1][Ring1][O][C][C][Ring1][Branch2][Branch1][C][C][C][Ring1][=Branch1][Branch1][C][O][C][Branch1][#Branch1][O][C][=Branch1][C][=O][C][C][Branch1][#Branch1][O][C][=Branch1][C][=O][C][C][Ring2][Ring1][N][Branch1][#C][C][O][C][=Branch1][C][=O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][Branch1][Branch2][C][O][C][=Branch1][C][=O][C][C][Ring2][Ring2][S][C][C][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]' |
|
- source_sentence: '[O][=C][O][C][=C][Branch2][Ring1][#C][C][=C][C][O][C][N][Branch1][S][C][=C][C][=C][Branch1][=Branch1][O][C][C][C][C][C][=C][Ring1][O][C][C][Ring2][Ring1][Branch1][=Ring1][P][C][=Branch1][=C][=C][Ring2][Ring1][=Branch2][C][C][=C][C][=C][C][=C][Ring1][=Branch1][C]' |
|
sentences: |
|
- '[O][=C][N][C][Branch1][S][C][=Branch1][C][=O][N][C][=C][C][=C][C][=C][Ring1][N][Ring1][=Branch1][C][C][=Branch1][C][=O][N][C][C][C][=N][C][=Branch1][Branch1][=C][S][Ring1][Branch1][C]' |
|
- '[O][=C][C][=C][Branch2][Branch2][O][O][C][=C][C][Branch2][Ring2][#C][O][C][C][Branch1][Ring1][C][O][C][C][C][=C][C][NH1][C][=C][C][=Ring1][Branch1][C][=C][Ring1][=Branch2][C][C][=Branch1][C][=O][C][Branch1][Ring1][C][O][C][C][=Branch1][Branch2][=C][Ring2][Ring1][#C][Ring2][Ring1][O][C][Ring1][#Branch2][=C][C][Branch2][Ring2][O][O][C][Branch1][Ring2][C][Ring1][Branch1][C][Branch1][C][O][C][Branch2][Ring1][Branch2][C][=C][C][C][Branch1][S][N][Branch1][=Branch1][C][C][C][O][C][C][C][C][Ring1][S][Ring1][O][C][C][C][C][C][O][=C][Ring2][Branch1][=Branch2][C][O][C][=Branch1][C][=O][O][C][C]' |
|
- '[O][C][C][O][C][Branch2][Branch2][#Branch2][O][C][C][Branch1][C][O][C][Branch1][C][O][C][Branch2][#Branch1][#Branch2][O][C][Ring1][Branch2][O][C][C][C][C][Branch2][=Branch1][=N][C][=Branch2][Branch1][P][=C][C][Branch1][C][O][C][Branch1][C][C][C][Ring1][Branch2][C][C][Branch1][C][O][C][C][Branch1][=Branch2][C][C][C][Ring1][O][Ring1][Branch1][C][C][Branch2][Ring1][Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][Branch1][C][C][C][C][=C][C][Branch1][C][C][C][C][Ring2][Ring2][=Branch2][Branch1][C][C][C][C][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][Branch1][S][O]' |
|
- source_sentence: '[O][=C][O][C][=C][C][=C][Branch1][O][O][C][=Branch1][C][=O][N][Branch1][C][C][C][C][=C][Ring1][N][C][=Branch1][Ring2][=C][Ring1][S][C][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Ring1][=Branch2]' |
|
sentences: |
|
- '[O][=C][C][=Branch2][#Branch1][=N][=C][C][Branch2][Ring2][N][C][NH1+1][C][=Branch2][Ring2][Ring1][=C][Branch1][#Branch2][C][=N][C][=C][C][Ring1][Branch2][Ring1][Branch1][C][C][=C][C][=Branch1][S][=C][C][=Branch1][Ring2][=C][Ring1][=Branch1][C][C][C][O][C][C][Ring1][=Branch1][C][C][C][O][C][Branch1][C][O][C][C][Branch1][C][C][C][C][C][=Branch1][C][=O][C][Branch1][C][C][Branch1][C][C][C][Ring1][#Branch2][C][C][C][Ring1][=C][Branch1][C][C][C][Ring2][Ring2][=C][Branch1][C][C][C][Ring2][Branch1][C][C][Branch1][C][C][C][C][Branch1][C][O][C][O][C][Ring1][Ring1][Branch1][C][C][C]' |
|
- '[O][=C][Branch1][C][O][C][Branch2][O][P][O][C][C][=Branch1][C][=O][N][C][C][Branch1][C][O][C][C][Branch2][=Branch2][=Branch2][O][C][C][=Branch1][C][=O][N][C][C][Branch1][C][O][C][C][Branch2][=Branch1][P][O][C][C][Branch1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Branch2][Branch1][#Branch2][O][P][=Branch1][C][=O][Branch1][C][O][O][C][C][Branch2][Ring1][O][N][C][=Branch1][C][=O][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][C][C][C][C][C][C][C][C][C][C][C][Ring2][Branch1][#Branch1][O][Branch1][P][O][C][Ring2][Branch1][S][C][Branch1][C][O][C][Branch1][C][O][C][O][C][C][=Branch1][C][=O][O][Branch1][P][O][C][Ring2][#Branch1][=Branch1][C][Branch1][C][O][C][Branch1][C][O][C][O][C][C][=Branch1][C][=O][O][O][C][Branch1][N][C][Branch1][C][O][C][Branch1][C][O][C][O][C][C][Branch1][Branch2][N][C][=Branch1][C][=O][C][O][C][Branch1][C][O][C][Ring2][=Branch2][Branch2]' |
|
- '[O][=C][Branch1][C][N][C][=N][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][N][Ring1][O][C][C][O][C]' |
|
- source_sentence: '[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]' |
|
sentences: |
|
- '[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]' |
|
- '[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]' |
|
- '[O][=C][Branch1][#Branch2][C][=C][C][#C][C][#C][C][#C][C][N][C][C][C][=C][C][=C][C][=C][Ring1][=Branch1]' |
|
model-index: |
|
- name: SentenceTransformer |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
name: NP isotest |
|
type: NP-isotest |
|
metrics: |
|
- type: pearson_cosine |
|
value: 0.936731178796972 |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: 0.93027366634068 |
|
name: Spearman Cosine |
|
- type: pearson_manhattan |
|
value: 0.826340669261792 |
|
name: Pearson Manhattan |
|
- type: spearman_manhattan |
|
value: 0.845192256146849 |
|
name: Spearman Manhattan |
|
- type: pearson_euclidean |
|
value: 0.842726066770598 |
|
name: Pearson Euclidean |
|
- type: spearman_euclidean |
|
value: 0.865381289346298 |
|
name: Spearman Euclidean |
|
- type: pearson_dot |
|
value: 0.924283770507162 |
|
name: Pearson Dot |
|
- type: spearman_dot |
|
value: 0.923230424410894 |
|
name: Spearman Dot |
|
- type: pearson_max |
|
value: 0.936731178796972 |
|
name: Pearson Max |
|
- type: spearman_max |
|
value: 0.93027366634068 |
|
name: Spearman Max |
|
--- |
|
|
|
# ChEmbed v0.1 - Chemical Embeddings |
|
|
|
This prototype is a [sentence-transformers](https://www.SBERT.net) based on [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) fine-tuned on around 1 million pairs of valid natural compounds' SELFIES [(Krenn et al. 2020)](https://github.com/aspuru-guzik-group/selfies) taken from COCONUTDB [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/). It maps compounds' *Self-Referencing Embedded Strings* (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more. |
|
|
|
I am planning to train this model with more epochs on current dataset, before moving on to a larger dataset with 6 million pairs generated from ChemBL34. However, this will take some time due to computational and financial constraints. A future project of mine is to develop a custom model specifically for cheminformatics to address any biases and optimization issues in repurposing an embedding model designed for NLP tasks. |
|
|
|
### Disclaimer: For Academic Purposes Only |
|
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [MiniLM-L6-H384-uncased](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
- **Training Dataset:** SELFIES pairs generated from COCONUTDB |
|
- **Language:** SELFIES |
|
- **License:** CC BY-NC 4.0 |
|
|
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False}) |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("gbyuvd/ChEmbed-v01") |
|
# Run inference |
|
sentences = [ |
|
'[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]', |
|
'[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]', |
|
'[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]', |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 768] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
## Dataset |
|
|
|
| Dataset | Reference | Number of Pairs | |
|
|:---------------------------|:-----------|:-----------| |
|
| COCONUTDB (0.8:0.1:0.1 split) | [(Sorokina et al. 2021)](https://coconut.naturalproducts.net/) | 1183174 | |
|
|
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Semantic Similarity |
|
* Dataset: `NP-isotest` |
|
* Number of test pairs: 118,318 |
|
* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) |
|
|
|
| Metric | Value | |
|
|:--------------------|:-----------| |
|
| pearson_cosine | 0.9367 | |
|
| **spearman_cosine** | **0.9303** | |
|
| pearson_manhattan | 0.8263 | |
|
| spearman_manhattan | 0.8452 | |
|
| pearson_euclidean | 0.8654 | |
|
| spearman_euclidean | 0.9243 | |
|
| pearson_dot | 0.9232 | |
|
| spearman_dot | 0.9367 | |
|
| pearson_max | 0.9303 | |
|
| spearman_max | 0.8961 | |
|
|
|
## Limitations |
|
For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one. |
|
|
|
## Testing Generated Embeddings' Clusters |
|
The plot below show how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/c8_5IWjPgbrGY0Z9-ZHop.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/EHEcaSnra4lldI0LY5tGq.png) |
|
|
|
### Framework Versions |
|
- Python: 3.9.13 |
|
- Sentence Transformers: 3.0.1 |
|
- Transformers: 4.41.2 |
|
- PyTorch: 2.3.1+cu121 |
|
- Accelerate: 0.31.0 |
|
- Datasets: 2.20.0 |
|
- Tokenizers: 0.19.1 |
|
|
|
## Contact |
|
|
|
G Bayu ([email protected]) |