Базовый Bert для Semantic text similarity (STS) на GPU

Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на cointegrated/LaBSE-en-ru - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.

Использование модели с библиотекой transformers:

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (768,)

Использование с sentence_transformers:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')

sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

Метрики

Оценки модели на бенчмарке encodechka:

Модель STS PI NLI SA TI
intfloat/multilingual-e5-large 0.862 0.727 0.473 0.810 0.979
sergeyzh/LaBSE-ru-sts 0.845 0.737 0.481 0.805 0.957
sergeyzh/rubert-mini-sts 0.815 0.723 0.477 0.791 0.949
sergeyzh/rubert-tiny-sts 0.797 0.702 0.453 0.778 0.946
Tochka-AI/ruRoPEBert-e5-base-512 0.793 0.704 0.457 0.803 0.970
cointegrated/LaBSE-en-ru 0.794 0.659 0.431 0.761 0.946
cointegrated/rubert-tiny2 0.750 0.651 0.417 0.737 0.937

Задачи:

  • Semantic text similarity (STS);
  • Paraphrase identification (PI);
  • Natural language inference (NLI);
  • Sentiment analysis (SA);
  • Toxicity identification (TI).

Быстродействие и размеры

Оценки модели на бенчмарке encodechka:

Модель CPU GPU size dim n_ctx n_vocab
intfloat/multilingual-e5-large 149.026 15.629 2136 1024 514 250002
sergeyzh/LaBSE-ru-sts 42.835 8.561 490 768 512 55083
sergeyzh/rubert-mini-sts 6.417 5.517 123 312 2048 83828
sergeyzh/rubert-tiny-sts 3.208 3.379 111 312 2048 83828
Tochka-AI/ruRoPEBert-e5-base-512 43.314 9.338 532 768 512 69382
cointegrated/LaBSE-en-ru 42.867 8.549 490 768 512 55083
cointegrated/rubert-tiny2 3.212 3.384 111 312 2048 83828

Оценки модели на бенчмарке ruMTEB:

Model Name Metric sbert_large_ mt_nlu_ru sbert_large_ nlu_ru LaBSE-ru-sts LaBSE-ru-turbo multilingual-e5-small multilingual-e5-base multilingual-e5-large
CEDRClassification Accuracy 0.368 0.358 0.418 0.451 0.401 0.423 0.448
GeoreviewClassification Accuracy 0.397 0.400 0.406 0.438 0.447 0.461 0.497
GeoreviewClusteringP2P V-measure 0.584 0.590 0.626 0.644 0.586 0.545 0.605
HeadlineClassification Accuracy 0.772 0.793 0.633 0.688 0.732 0.757 0.758
InappropriatenessClassification Accuracy 0.646 0.625 0.599 0.615 0.592 0.588 0.616
KinopoiskClassification Accuracy 0.503 0.495 0.496 0.521 0.500 0.509 0.566
RiaNewsRetrieval NDCG@10 0.214 0.111 0.651 0.694 0.700 0.702 0.807
RuBQReranking MAP@10 0.561 0.468 0.688 0.687 0.715 0.720 0.756
RuBQRetrieval NDCG@10 0.298 0.124 0.622 0.657 0.685 0.696 0.741
RuReviewsClassification Accuracy 0.589 0.583 0.599 0.632 0.612 0.630 0.653
RuSTSBenchmarkSTS Pearson correlation 0.712 0.588 0.788 0.822 0.781 0.796 0.831
RuSciBenchGRNTIClassification Accuracy 0.542 0.539 0.529 0.569 0.550 0.563 0.582
RuSciBenchGRNTIClusteringP2P V-measure 0.522 0.504 0.486 0.517 0.511 0.516 0.520
RuSciBenchOECDClassification Accuracy 0.438 0.430 0.406 0.440 0.427 0.423 0.445
RuSciBenchOECDClusteringP2P V-measure 0.473 0.464 0.426 0.452 0.443 0.448 0.450
SensitiveTopicsClassification Accuracy 0.285 0.280 0.262 0.272 0.228 0.234 0.257
TERRaClassification Average Precision 0.520 0.502 0.587 0.585 0.551 0.550 0.584
Model Name Metric sbert_large_ mt_nlu_ru sbert_large_ nlu_ru LaBSE-ru-sts LaBSE-ru-turbo multilingual-e5-small multilingual-e5-base multilingual-e5-large
Classification Accuracy 0.554 0.552 0.524 0.558 0.551 0.561 0.588
Clustering V-measure 0.526 0.519 0.513 0.538 0.513 0.503 0.525
MultiLabelClassification Accuracy 0.326 0.319 0.340 0.361 0.314 0.329 0.353
PairClassification Average Precision 0.520 0.502 0.587 0.585 0.551 0.550 0.584
Reranking MAP@10 0.561 0.468 0.688 0.687 0.715 0.720 0.756
Retrieval NDCG@10 0.256 0.118 0.637 0.675 0.697 0.699 0.774
STS Pearson correlation 0.712 0.588 0.788 0.822 0.781 0.796 0.831
Average Average 0.494 0.438 0.582 0.604 0.588 0.594 0.630
Downloads last month
285
Safetensors
Model size
129M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for sergeyzh/LaBSE-ru-sts

Finetuned
(4)
this model