ChEmbed v0.1 - Chemical Embeddings

This prototype is a sentence-transformers based on MiniLM-L6-H384-uncased fine-tuned on around 1 million pairs of valid natural compounds' SELFIES (Krenn et al. 2020) taken from COCONUTDB (Sorokina et al. 2021). It maps compounds' Self-Referencing Embedded Strings (SELFIES) into a 768-dimensional dense vector space, potentially can be used for chemical similarity, similarity search, classification, clustering, and more.

I am planning to train this model with more epochs on current dataset, before moving on to a larger dataset with 6 million pairs generated from ChemBL34. However, this will take some time due to computational and financial constraints. A future project of mine is to develop a custom model specifically for cheminformatics to address any biases and optimization issues in repurposing an embedding model designed for NLP tasks.

Update

This model won't be trained further on current natural products dataset nor the ChemBL34, since I've been working on pre-training a BERT-like base model that operates on SELFIES with a custom tokenizer for past two weeks. This base model was scheduled for release this week, but due to mistakes in parsing some SELFIES notations, the pre-training is halted and I am working intensely to correct these issues and continue the training. The base model will hopefully released next week. Following this, I plan to fine-tune a sentence transformer and a classifier model built on top of that base model. The timeline for these tasks depends on the availability of compute server and my own time constraints, as I also need to finish my undergrad thesis. Thank you for checking out this model.

The base model is now available here New version of this model is now availale here

Disclaimer: For Academic Purposes Only

The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: MiniLM-L6-H384-uncased
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity
Training Dataset: SELFIES pairs generated from COCONUTDB
Language: SELFIES
License: CC BY-NC 4.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("gbyuvd/ChemEmbed-v01")
# Run inference
sentences = [
    '[O][=C][Branch1][#Branch1][C][=C][C][C][C][=C][C]',
    '[O][=C][Branch1][C][O][C][C][C][C][C][C][C][C][Branch1][C][O][C][=C][C][#C][C][=C][C][C][C]',
    '[O][C][C][O][C][Branch2][=Branch2][Ring1][O][C][Branch1][C][C][C][C][C][Branch1][C][O][O][C][C][C][C][C][C][C][C][C][Branch2][Branch1][N][O][C][O][C][Branch1][Ring1][C][O][C][Branch2][Ring2][#Branch1][O][C][O][C][Branch1][Ring1][C][O][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][#Branch2][O][C][O][C][Branch1][C][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring1][=Branch2][O][C][Branch1][C][O][C][Ring2][Ring1][#C][O][C][C][C][Ring2][Ring2][#Branch1][Branch1][C][C][C][Ring2][Ring2][N][C][C][C][Ring2][Ring2][S][Branch1][C][C][C][Ring2][Branch1][Ring2][C][Ring2][Branch1][Branch2][C][C][Branch1][C][O][C][Branch1][C][O][C][Ring2][=Branch1][=Branch1][O]',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Dataset

Dataset	Reference	Number of Pairs
COCONUTDB (0.8:0.1:0.1 split)	(Sorokina et al. 2021)	1,183,174

Evaluation

Metrics

Semantic Similarity

Dataset: NP-isotest
Number of test pairs: 118,318
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.9367
spearman_cosine	0.9303
pearson_manhattan	0.8263
spearman_manhattan	0.8452
pearson_euclidean	0.8654
spearman_euclidean	0.9243
pearson_dot	0.9232
spearman_dot	0.9367
pearson_max	0.9303
spearman_max	0.8961

Limitations

For now, the model might be ineffective in embedding synthetic drugs, since it is still trained on just natural products. Also, the tokenizer used is still uncustomized one.

Testing Generated Embeddings' Clusters

The plot below shows how the model's embeddings (at this stage) cluster different classes of compounds, compared to using MACCS fingerprints.

Framework Versions

Python: 3.9.13
Sentence Transformers: 3.0.1
Transformers: 4.41.2
PyTorch: 2.3.1+cu121
Accelerate: 0.31.0
Datasets: 2.20.0
Tokenizers: 0.19.1

Contact

G Bayu ([email protected])

gbyuvd
/

ChemEmbed-v01