ChEmbed v0.1 - Chemical Embeddings
This prototype is a sentence-transformers based on MiniLM-L6-H384-uncased fine-tuned on around 1 million pairs of valid natural compounds' SELFIES (Krenn et al. 2020) taken from COCONUTDB (Sorokina et al. 2021). It maps compounds' Self-Referencing Embedded Strings (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.
I am planning to train this model with more epochs on current dataset, before moving on to a larger dataset with 6 million pairs generated from ChemBL34. However, this will take some time due to computational and financial constraints. A future project of mine is to develop a custom model specifically for cheminformatics to address any biases and optimization issues in repurposing an embedding model designed for NLP tasks.
Update
This model won't be trained further on current natural products dataset nor the ChemBL34, since I've been working on pre-training a BERT-like base model that operates on SELFIES with a custom tokenizer for past two weeks. This base model was scheduled for release this week, but due to mistakes in parsing some SELFIES notations, the pre-training is halted and I am working intensely to correct these issues and continue the training. The base model will hopefully released next week. Following this, I plan to fine-tune a sentence transformer and a classifier model built on top of that base model. The timeline for these tasks depends on the availability of compute server and my own time constraints, as I also need to finish my undergrad thesis. Thank you for checking out this model.
The base model is now available here New version of this model is now availale here
Disclaimer: For Academic Purposes Only
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: MiniLM-L6-H384-uncased
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Training Dataset: SELFIES pairs generated from COCONUTDB
- Language: SELFIES
- License: CC BY-NC 4.0
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("gbyuvd/ChemEmbed-v01")
# Run inference
sentences = [
'[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]',
'[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]',
'[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Dataset
Dataset | Reference | Number of Pairs |
---|---|---|
COCONUTDB (0.8:0.1:0.1 split) | (Sorokina et al. 2021) | 1,183,174 |
Evaluation
Metrics
Semantic Similarity
- Dataset:
NP-isotest
- Number of test pairs: 118,318
- Evaluated with
EmbeddingSimilarityEvaluator
Metric | Value |
---|---|
pearson_cosine | 0.9367 |
spearman_cosine | 0.9303 |
pearson_manhattan | 0.8263 |
spearman_manhattan | 0.8452 |
pearson_euclidean | 0.8654 |
spearman_euclidean | 0.9243 |
pearson_dot | 0.9232 |
spearman_dot | 0.9367 |
pearson_max | 0.9303 |
spearman_max | 0.8961 |
Limitations
For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.
Testing Generated Embeddings' Clusters
The plot below shows how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.
Framework Versions
- Python: 3.9.13
- Sentence Transformers: 3.0.1
- Transformers: 4.41.2
- PyTorch: 2.3.1+cu121
- Accelerate: 0.31.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
Contact
G Bayu ([email protected])
- Downloads last month
- 23
Evaluation results
- Pearson Cosine on NP isotestself-reported0.937
- Spearman Cosine on NP isotestself-reported0.930
- Pearson Manhattan on NP isotestself-reported0.826
- Spearman Manhattan on NP isotestself-reported0.845
- Pearson Euclidean on NP isotestself-reported0.843
- Spearman Euclidean on NP isotestself-reported0.865
- Pearson Dot on NP isotestself-reported0.924
- Spearman Dot on NP isotestself-reported0.923
- Pearson Max on NP isotestself-reported0.937
- Spearman Max on NP isotestself-reported0.930