|
---
|
|
language:
|
|
- ru
|
|
|
|
pipeline_tag: sentence-similarity
|
|
|
|
tags:
|
|
- russian
|
|
- pretraining
|
|
- embeddings
|
|
- feature-extraction
|
|
- sentence-similarity
|
|
- sentence-transformers
|
|
- transformers
|
|
|
|
license: mit
|
|
base_model: cointegrated/LaBSE-en-ru
|
|
|
|
---
|
|
|
|
## Базовый Bert для Semantic text similarity (STS) на GPU
|
|
|
|
Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
|
|
|
|
## Использование модели с библиотекой `transformers`:
|
|
|
|
```python
|
|
# pip install transformers sentencepiece
|
|
import torch
|
|
from transformers import AutoTokenizer, AutoModel
|
|
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
|
|
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
|
|
# model.cuda() # uncomment it if you have a GPU
|
|
|
|
def embed_bert_cls(text, model, tokenizer):
|
|
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
|
|
with torch.no_grad():
|
|
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
|
|
embeddings = model_output.last_hidden_state[:, 0, :]
|
|
embeddings = torch.nn.functional.normalize(embeddings)
|
|
return embeddings[0].cpu().numpy()
|
|
|
|
print(embed_bert_cls('привет мир', model, tokenizer).shape)
|
|
# (768,)
|
|
```
|
|
|
|
## Использование с `sentence_transformers`:
|
|
```Python
|
|
from sentence_transformers import SentenceTransformer, util
|
|
|
|
model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
|
|
|
|
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
|
|
embeddings = model.encode(sentences)
|
|
print(util.dot_score(embeddings, embeddings))
|
|
```
|
|
|
|
## Метрики
|
|
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
|
|
|
|
| Модель | STS | PI | NLI | SA | TI |
|
|
|:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
|
|
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
|
|
| **sergeyzh/LaBSE-ru-sts** | 0.845 | 0.737 | 0.481 | 0.805 | 0.957 |
|
|
| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
|
|
| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
|
|
| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
|
|
| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
|
|
| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
|
|
|
|
**Задачи:**
|
|
|
|
- Semantic text similarity (**STS**);
|
|
- Paraphrase identification (**PI**);
|
|
- Natural language inference (**NLI**);
|
|
- Sentiment analysis (**SA**);
|
|
- Toxicity identification (**TI**).
|
|
|
|
## Быстродействие и размеры
|
|
|
|
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
|
|
|
|
| Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
|
|
|:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
|
|
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
|
|
| **sergeyzh/LaBSE-ru-sts** | 42.835 | 8.561 | 490 | 768 | 512 | 55083 |
|
|
| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 6.417 | 5.517 | 123 | 312 | 2048 | 83828 |
|
|
| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
|
|
| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
|
|
| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
|
|
| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
|
|
|
|
|
|
Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):
|
|
|
|
|Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
|
|
|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:|
|
|
|CEDRClassification | Accuracy | 0.368 | 0.358 | 0.418 | 0.451 | 0.401 | 0.423 | **0.448** |
|
|
|GeoreviewClassification | Accuracy | 0.397 | 0.400 | 0.406 | 0.438 | 0.447 | 0.461 | **0.497** |
|
|
|GeoreviewClusteringP2P | V-measure | 0.584 | 0.590 | 0.626 | **0.644** | 0.586 | 0.545 | 0.605 |
|
|
|HeadlineClassification | Accuracy | 0.772 | **0.793** | 0.633 | 0.688 | 0.732 | 0.757 | 0.758 |
|
|
|InappropriatenessClassification | Accuracy | **0.646** | 0.625 | 0.599 | 0.615 | 0.592 | 0.588 | 0.616 |
|
|
|KinopoiskClassification | Accuracy | 0.503 | 0.495 | 0.496 | 0.521 | 0.500 | 0.509 | **0.566** |
|
|
|RiaNewsRetrieval | NDCG@10 | 0.214 | 0.111 | 0.651 | 0.694 | 0.700 | 0.702 | **0.807** |
|
|
|RuBQReranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
|
|
|RuBQRetrieval | NDCG@10 | 0.298 | 0.124 | 0.622 | 0.657 | 0.685 | 0.696 | **0.741** |
|
|
|RuReviewsClassification | Accuracy | 0.589 | 0.583 | 0.599 | 0.632 | 0.612 | 0.630 | **0.653** |
|
|
|RuSTSBenchmarkSTS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
|
|
|RuSciBenchGRNTIClassification | Accuracy | 0.542 | 0.539 | 0.529 | 0.569 | 0.550 | 0.563 | **0.582** |
|
|
|RuSciBenchGRNTIClusteringP2P | V-measure | **0.522** | 0.504 | 0.486 | 0.517 | 0.511 | 0.516 | 0.520 |
|
|
|RuSciBenchOECDClassification | Accuracy | 0.438 | 0.430 | 0.406 | 0.440 | 0.427 | 0.423 | **0.445** |
|
|
|RuSciBenchOECDClusteringP2P | V-measure | **0.473** | 0.464 | 0.426 | 0.452 | 0.443 | 0.448 | 0.450 |
|
|
|SensitiveTopicsClassification | Accuracy | **0.285** | 0.280 | 0.262 | 0.272 | 0.228 | 0.234 | 0.257 |
|
|
|TERRaClassification | Average Precision | 0.520 | 0.502 | **0.587** | 0.585 | 0.551 | 0.550 | 0.584 |
|
|
|
|
|Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
|
|
|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:|
|
|
|Classification | Accuracy | 0.554 | 0.552 | 0.524 | 0.558 | 0.551 | 0.561 | **0.588** |
|
|
|Clustering | V-measure | 0.526 | 0.519 | 0.513 | **0.538** | 0.513 | 0.503 | 0.525 |
|
|
|MultiLabelClassification | Accuracy | 0.326 | 0.319 | 0.340 | **0.361** | 0.314 | 0.329 | 0.353 |
|
|
|PairClassification | Average Precision | 0.520 | 0.502 | 0.587 | **0.585** | 0.551 | 0.550 | 0.584 |
|
|
|Reranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
|
|
|Retrieval | NDCG@10 | 0.256 | 0.118 | 0.637 | 0.675 | 0.697 | 0.699 | **0.774** |
|
|
|STS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
|
|
|Average | Average | 0.494 | 0.438 | 0.582 | 0.604 | 0.588 | 0.594 | **0.630** |
|
|
|
|
|
|
|
|
|
|
|