TSjB/labse-qm

It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Fine-tined by Bogdan Tewunalany
Based on LaBSE

Usage (Sentence-Transformers)

Python:

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Бу айтым юлгюдю"]

model = SentenceTransformer('TSjB/labse-qm')
embeddings = model.encode(sentences)
print(embeddings)

R language:

library(data.table)
library(reticulate)
library(ggplot2)
library(ggrepel)
library(Rtsne)

py_install("sentence-transformers", pip = TRUE)
st <- import("sentence_transformers")

english_sentences = base::c("dog", "Puppies are nice.", "I enjoy taking long walks along the beach with my dog.")
italian_sentences = base::c("cane", "I cuccioli sono carini.", "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.")
qarachay_sentences = base::c("ит", "Итле джагъымлыдыла.", "Джагъа юсю бла итим бла айланыргъа сюеме.")

model = st$SentenceTransformer('TSjB/labse-qm')

english_embeddings = model$encode(english_sentences)
italian_embeddings = model$encode(italian_sentences)
qarachay_embeddings = model$encode(qarachay_sentences)

m <- rbind(english_embeddings,
           italian_embeddings,
           qarachay_embeddings) %>% as.matrix

tsne <- Rtsne(m, perplexity = floor((nrow(m) - 1) / 3))


tSNE_df <- tsne$Y %>% 
  as.data.table() %>% 
  setnames(old = c("V1", "V2"), new = c("tSNE1", "tSNE2")) %>% 
  .[, `:=`(sentence = c(english_sentences, italian_sentences, qarachay_sentences),
           language = c(rep("english", length(english_sentences)),
                        rep("italian", length(italian_sentences)),
                        rep("qarachay", length(qarachay_sentences))))]


tSNE_df %>%
 ggplot(aes(x = tSNE1, 
            y = tSNE2,
            color = language,
            label = sentence             
             )
         )  + 
    geom_label_repel() +    
  geom_point()

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 6439 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 100,
    "evaluator": "__main__.ChainScoreEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "warmupcosine",
    "steps_per_epoch": null,
    "warmup_steps": 1000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)
Downloads last month
8
Safetensors
Model size
471M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.