Fascinating work!

#1
by tomaarsen HF staff - opened

Hello!

Sentence Transformers maintainer here - this is fascinating work! The chemical natural compounds and their notations go way beyond what I'm familiar with, but it looks like the Spearman Cosine similarity is very high, and the t-SNE embeddings look quite nice!

I see that you have some plans to extend this further in the future. I wanted to point you to a potential direction of advancements: the tokenizer.
Each tokenizer tokenises text differently, and the one that you're using (from MiniLM-L6-H384-uncased) is not aware of the natural compound notations. As a result, it uses multiple tokens to denotate something that maybe can be best denoted with just one token, e.g. [C]. See an example here:

image.png

From https://huggingface.co/spaces/Xenova/the-tokenizer-playground

In short: it might make sense to 1) take an existing tokenizer trained on the chemical compounds or 2) train one yourself.
Do note that you'd likely not be able to use a pretrained model with your custom tokenizer, so you would have to perform the training from random weights. With a much smaller tokenizer, you'll also get higher throughput/faster training I suspect.

Anyways, you're free to go this route or continue finetuning "ready to go" embedding models like MiniLM-L6-H384-uncased: clearly it's also working well.

  • Tom Aarsen

Hello!

Thank you so much for your feedback, I appreciate your recommendations a lot. Currently, I am trying to either adapt zpn's SELFIES tokenizer or train a custom tokenizer for this, since in chemistry usually molecules are represented with SMILES and it is known a bit messy to train a model using it - and SELFIES seems better due to its consistency. I plan to start testing them shortly, and will proceed with training a base model with randomized weights along with reduced vocabulary size.

Thanks again for taking the time to engage with my work and for pointing me in this direction. I am relatively new in ML/AI, so I am excited to see the results!

  • G Bayu

Excellent! I think you're well in the right direction then!
Your work reminds me somewhat of the Protein Similarity and Matryoshka Embeddings blogpost by @monsoon-nlp from a few months ago, except with proteins instead. He also used Matryoshka Embeddings (blogpost, documentation) in case that strikes your fancy. In short: Matryoshka Embeddings can be truncated on the fly with minor loss in performance, allowing for faster retrieval/clustering. This can be quite nice when your use case deals with a lot of data.

  • Tom Aarsen

I didn't know about Matryoshka, but after reading both blogs a bit, I agree it would be nice for dealing with large chemical databases. I will read those blogs again and try experimenting with them after training with the base model and custom tokenizer seems good enough. Again, thank you!

  • G Bayu
gbyuvd changed discussion status to closed

Sign up or log in to comment