File size: 11,496 Bytes
66b677b 00c333c 66b677b 00c333c 66b677b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- ru
pipeline_tag: sentence-similarity
tags:
- russian
- pretraining
- embeddings
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers
license: mit
base_model: cointegrated/LaBSE-en-ru
---
## Базовый Bert для Semantic text similarity (STS) на GPU
Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
## Использование модели с библиотекой `transformers`:
```python
# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
# model.cuda() # uncomment it if you have a GPU
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (768,)
```
## Использование с `sentence_transformers`:
```Python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))
```
## Метрики
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
| Модель | STS | PI | NLI | SA | TI |
|:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
| **sergeyzh/LaBSE-ru-sts** | 0.845 | 0.737 | 0.481 | 0.805 | 0.957 |
| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
**Задачи:**
- Semantic text similarity (**STS**);
- Paraphrase identification (**PI**);
- Natural language inference (**NLI**);
- Sentiment analysis (**SA**);
- Toxicity identification (**TI**).
## Быстродействие и размеры
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
| Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
|:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
| **sergeyzh/LaBSE-ru-sts** | 42.835 | 8.561 | 490 | 768 | 512 | 55083 |
| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 6.417 | 5.517 | 123 | 312 | 2048 | 83828 |
| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):
|Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:|
|CEDRClassification | Accuracy | 0.368 | 0.358 | 0.418 | 0.451 | 0.401 | 0.423 | **0.448** |
|GeoreviewClassification | Accuracy | 0.397 | 0.400 | 0.406 | 0.438 | 0.447 | 0.461 | **0.497** |
|GeoreviewClusteringP2P | V-measure | 0.584 | 0.590 | 0.626 | **0.644** | 0.586 | 0.545 | 0.605 |
|HeadlineClassification | Accuracy | 0.772 | **0.793** | 0.633 | 0.688 | 0.732 | 0.757 | 0.758 |
|InappropriatenessClassification | Accuracy | **0.646** | 0.625 | 0.599 | 0.615 | 0.592 | 0.588 | 0.616 |
|KinopoiskClassification | Accuracy | 0.503 | 0.495 | 0.496 | 0.521 | 0.500 | 0.509 | **0.566** |
|RiaNewsRetrieval | NDCG@10 | 0.214 | 0.111 | 0.651 | 0.694 | 0.700 | 0.702 | **0.807** |
|RuBQReranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
|RuBQRetrieval | NDCG@10 | 0.298 | 0.124 | 0.622 | 0.657 | 0.685 | 0.696 | **0.741** |
|RuReviewsClassification | Accuracy | 0.589 | 0.583 | 0.599 | 0.632 | 0.612 | 0.630 | **0.653** |
|RuSTSBenchmarkSTS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
|RuSciBenchGRNTIClassification | Accuracy | 0.542 | 0.539 | 0.529 | 0.569 | 0.550 | 0.563 | **0.582** |
|RuSciBenchGRNTIClusteringP2P | V-measure | **0.522** | 0.504 | 0.486 | 0.517 | 0.511 | 0.516 | 0.520 |
|RuSciBenchOECDClassification | Accuracy | 0.438 | 0.430 | 0.406 | 0.440 | 0.427 | 0.423 | **0.445** |
|RuSciBenchOECDClusteringP2P | V-measure | **0.473** | 0.464 | 0.426 | 0.452 | 0.443 | 0.448 | 0.450 |
|SensitiveTopicsClassification | Accuracy | **0.285** | 0.280 | 0.262 | 0.272 | 0.228 | 0.234 | 0.257 |
|TERRaClassification | Average Precision | 0.520 | 0.502 | **0.587** | 0.585 | 0.551 | 0.550 | 0.584 |
|Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:|
|Classification | Accuracy | 0.554 | 0.552 | 0.524 | 0.558 | 0.551 | 0.561 | **0.588** |
|Clustering | V-measure | 0.526 | 0.519 | 0.513 | **0.538** | 0.513 | 0.503 | 0.525 |
|MultiLabelClassification | Accuracy | 0.326 | 0.319 | 0.340 | **0.361** | 0.314 | 0.329 | 0.353 |
|PairClassification | Average Precision | 0.520 | 0.502 | 0.587 | **0.585** | 0.551 | 0.550 | 0.584 |
|Reranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
|Retrieval | NDCG@10 | 0.256 | 0.118 | 0.637 | 0.675 | 0.697 | 0.699 | **0.774** |
|STS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
|Average | Average | 0.494 | 0.438 | 0.582 | 0.604 | 0.588 | 0.594 | **0.630** |
|