LaBSE-ru-sts / README.md

Update README.md

00c333c verified 5 months ago

11.5 kB

	---
	language:
	- ru

	pipeline_tag: sentence-similarity

	tags:
	- russian
	- pretraining
	- embeddings
	- feature-extraction
	- sentence-similarity
	- sentence-transformers
	- transformers

	license: mit
	base_model: cointegrated/LaBSE-en-ru

	---

	## Базовый Bert для Semantic text similarity (STS) на GPU

	Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.

	## Использование модели с библиотекой `transformers`:

	```python
	# pip install transformers sentencepiece
	import torch
	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
	model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
	# model.cuda() # uncomment it if you have a GPU

	def embed_bert_cls(text, model, tokenizer):
	t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
	with torch.no_grad():
	model_output = model(**{k: v.to(model.device) for k, v in t.items()})
	embeddings = model_output.last_hidden_state[:, 0, :]
	embeddings = torch.nn.functional.normalize(embeddings)
	return embeddings[0].cpu().numpy()

	print(embed_bert_cls('привет мир', model, tokenizer).shape)
	# (768,)
	```

	## Использование с `sentence_transformers`:
	```Python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')

	sentences = ["привет мир", "hello world", "здравствуй вселенная"]
	embeddings = model.encode(sentences)
	print(util.dot_score(embeddings, embeddings))
	```

	## Метрики
	Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):

	\| Модель \| STS \| PI \| NLI \| SA \| TI \|
	\|:---------------------------------\|:---------:\|:---------:\|:---------:\|:---------:\|:---------:\|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 0.862 \| 0.727 \| 0.473 \| 0.810 \| 0.979 \|
	\| sergeyzh/LaBSE-ru-sts \| 0.845 \| 0.737 \| 0.481 \| 0.805 \| 0.957 \|
	\| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) \| 0.815 \| 0.723 \| 0.477 \| 0.791 \| 0.949 \|
	\| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) \| 0.797 \| 0.702 \| 0.453 \| 0.778 \| 0.946 \|
	\| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) \| 0.793 \| 0.704 \| 0.457 \| 0.803 \| 0.970 \|
	\| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) \| 0.794 \| 0.659 \| 0.431 \| 0.761 \| 0.946 \|
	\| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \| 0.750 \| 0.651 \| 0.417 \| 0.737 \| 0.937 \|

	Задачи:

	- Semantic text similarity (STS);
	- Paraphrase identification (PI);
	- Natural language inference (NLI);
	- Sentiment analysis (SA);
	- Toxicity identification (TI).

	## Быстродействие и размеры

	Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):

	\| Модель \| CPU \| GPU \| size \| dim \| n_ctx \| n_vocab \|
	\|:---------------------------------\|----------:\|----------:\|----------:\|----------:\|----------:\|----------:\|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 149.026 \| 15.629 \| 2136 \| 1024 \| 514 \| 250002 \|
	\| sergeyzh/LaBSE-ru-sts \| 42.835 \| 8.561 \| 490 \| 768 \| 512 \| 55083 \|
	\| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) \| 6.417 \| 5.517 \| 123 \| 312 \| 2048 \| 83828 \|
	\| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) \| 3.208 \| 3.379 \| 111 \| 312 \| 2048 \| 83828 \|
	\| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) \| 43.314 \| 9.338 \| 532 \| 768 \| 512 \| 69382 \|
	\| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) \| 42.867 \| 8.549 \| 490 \| 768 \| 512 \| 55083 \|
	\| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \| 3.212 \| 3.384 \| 111 \| 312 \| 2048 \| 83828 \|


	Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| LaBSE-ru-sts \| [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|---------------------:\|----------------------:\|
	\|CEDRClassification \| Accuracy \| 0.368 \| 0.358 \| 0.418 \| 0.451 \| 0.401 \| 0.423 \| 0.448 \|
	\|GeoreviewClassification \| Accuracy \| 0.397 \| 0.400 \| 0.406 \| 0.438 \| 0.447 \| 0.461 \| 0.497 \|
	\|GeoreviewClusteringP2P \| V-measure \| 0.584 \| 0.590 \| 0.626 \| 0.644 \| 0.586 \| 0.545 \| 0.605 \|
	\|HeadlineClassification \| Accuracy \| 0.772 \| 0.793 \| 0.633 \| 0.688 \| 0.732 \| 0.757 \| 0.758 \|
	\|InappropriatenessClassification \| Accuracy \| 0.646 \| 0.625 \| 0.599 \| 0.615 \| 0.592 \| 0.588 \| 0.616 \|
	\|KinopoiskClassification \| Accuracy \| 0.503 \| 0.495 \| 0.496 \| 0.521 \| 0.500 \| 0.509 \| 0.566 \|
	\|RiaNewsRetrieval \| NDCG@10 \| 0.214 \| 0.111 \| 0.651 \| 0.694 \| 0.700 \| 0.702 \| 0.807 \|
	\|RuBQReranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|RuBQRetrieval \| NDCG@10 \| 0.298 \| 0.124 \| 0.622 \| 0.657 \| 0.685 \| 0.696 \| 0.741 \|
	\|RuReviewsClassification \| Accuracy \| 0.589 \| 0.583 \| 0.599 \| 0.632 \| 0.612 \| 0.630 \| 0.653 \|
	\|RuSTSBenchmarkSTS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|RuSciBenchGRNTIClassification \| Accuracy \| 0.542 \| 0.539 \| 0.529 \| 0.569 \| 0.550 \| 0.563 \| 0.582 \|
	\|RuSciBenchGRNTIClusteringP2P \| V-measure \| 0.522 \| 0.504 \| 0.486 \| 0.517 \| 0.511 \| 0.516 \| 0.520 \|
	\|RuSciBenchOECDClassification \| Accuracy \| 0.438 \| 0.430 \| 0.406 \| 0.440 \| 0.427 \| 0.423 \| 0.445 \|
	\|RuSciBenchOECDClusteringP2P \| V-measure \| 0.473 \| 0.464 \| 0.426 \| 0.452 \| 0.443 \| 0.448 \| 0.450 \|
	\|SensitiveTopicsClassification \| Accuracy \| 0.285 \| 0.280 \| 0.262 \| 0.272 \| 0.228 \| 0.234 \| 0.257 \|
	\|TERRaClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| LaBSE-ru-sts \| [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|----------------------:\|---------------------:\|
	\|Classification \| Accuracy \| 0.554 \| 0.552 \| 0.524 \| 0.558 \| 0.551 \| 0.561 \| 0.588 \|
	\|Clustering \| V-measure \| 0.526 \| 0.519 \| 0.513 \| 0.538 \| 0.513 \| 0.503 \| 0.525 \|
	\|MultiLabelClassification \| Accuracy \| 0.326 \| 0.319 \| 0.340 \| 0.361 \| 0.314 \| 0.329 \| 0.353 \|
	\|PairClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|
	\|Reranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|Retrieval \| NDCG@10 \| 0.256 \| 0.118 \| 0.637 \| 0.675 \| 0.697 \| 0.699 \| 0.774 \|
	\|STS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|Average \| Average \| 0.494 \| 0.438 \| 0.582 \| 0.604 \| 0.588 \| 0.594 \| 0.630 \|

	---
	language:
	- ru

	pipeline_tag: sentence-similarity

	tags:
	- russian
	- pretraining
	- embeddings
	- feature-extraction
	- sentence-similarity
	- sentence-transformers
	- transformers

	license: mit
	base_model: cointegrated/LaBSE-en-ru

	---

	## Базовый Bert для Semantic text similarity (STS) на GPU

	Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.

	## Использование модели с библиотекой `transformers`:

	```python
	# pip install transformers sentencepiece
	import torch
	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
	model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
	# model.cuda() # uncomment it if you have a GPU

	def embed_bert_cls(text, model, tokenizer):
	t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
	with torch.no_grad():
	model_output = model(**{k: v.to(model.device) for k, v in t.items()})
	embeddings = model_output.last_hidden_state[:, 0, :]
	embeddings = torch.nn.functional.normalize(embeddings)
	return embeddings[0].cpu().numpy()

	print(embed_bert_cls('привет мир', model, tokenizer).shape)
	# (768,)
	```

	## Использование с `sentence_transformers`:
	```Python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')

	sentences = ["привет мир", "hello world", "здравствуй вселенная"]
	embeddings = model.encode(sentences)
	print(util.dot_score(embeddings, embeddings))
	```

	## Метрики
	Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):

	\| Модель \| STS \| PI \| NLI \| SA \| TI \|
	\|:---------------------------------\|:---------:\|:---------:\|:---------:\|:---------:\|:---------:\|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 0.862 \| 0.727 \| 0.473 \| 0.810 \| 0.979 \|
	\| sergeyzh/LaBSE-ru-sts \| 0.845 \| 0.737 \| 0.481 \| 0.805 \| 0.957 \|
	\| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) \| 0.815 \| 0.723 \| 0.477 \| 0.791 \| 0.949 \|
	\| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) \| 0.797 \| 0.702 \| 0.453 \| 0.778 \| 0.946 \|
	\| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) \| 0.793 \| 0.704 \| 0.457 \| 0.803 \| 0.970 \|
	\| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) \| 0.794 \| 0.659 \| 0.431 \| 0.761 \| 0.946 \|
	\| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \| 0.750 \| 0.651 \| 0.417 \| 0.737 \| 0.937 \|

	Задачи:

	- Semantic text similarity (STS);
	- Paraphrase identification (PI);
	- Natural language inference (NLI);
	- Sentiment analysis (SA);
	- Toxicity identification (TI).

	## Быстродействие и размеры

	Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):

	\| Модель \| CPU \| GPU \| size \| dim \| n_ctx \| n_vocab \|
	\|:---------------------------------\|----------:\|----------:\|----------:\|----------:\|----------:\|----------:\|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 149.026 \| 15.629 \| 2136 \| 1024 \| 514 \| 250002 \|
	\| sergeyzh/LaBSE-ru-sts \| 42.835 \| 8.561 \| 490 \| 768 \| 512 \| 55083 \|
	\| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) \| 6.417 \| 5.517 \| 123 \| 312 \| 2048 \| 83828 \|
	\| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) \| 3.208 \| 3.379 \| 111 \| 312 \| 2048 \| 83828 \|
	\| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) \| 43.314 \| 9.338 \| 532 \| 768 \| 512 \| 69382 \|
	\| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) \| 42.867 \| 8.549 \| 490 \| 768 \| 512 \| 55083 \|
	\| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \| 3.212 \| 3.384 \| 111 \| 312 \| 2048 \| 83828 \|


	Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| LaBSE-ru-sts \| [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|---------------------:\|----------------------:\|
	\|CEDRClassification \| Accuracy \| 0.368 \| 0.358 \| 0.418 \| 0.451 \| 0.401 \| 0.423 \| 0.448 \|
	\|GeoreviewClassification \| Accuracy \| 0.397 \| 0.400 \| 0.406 \| 0.438 \| 0.447 \| 0.461 \| 0.497 \|
	\|GeoreviewClusteringP2P \| V-measure \| 0.584 \| 0.590 \| 0.626 \| 0.644 \| 0.586 \| 0.545 \| 0.605 \|
	\|HeadlineClassification \| Accuracy \| 0.772 \| 0.793 \| 0.633 \| 0.688 \| 0.732 \| 0.757 \| 0.758 \|
	\|InappropriatenessClassification \| Accuracy \| 0.646 \| 0.625 \| 0.599 \| 0.615 \| 0.592 \| 0.588 \| 0.616 \|
	\|KinopoiskClassification \| Accuracy \| 0.503 \| 0.495 \| 0.496 \| 0.521 \| 0.500 \| 0.509 \| 0.566 \|
	\|RiaNewsRetrieval \| NDCG@10 \| 0.214 \| 0.111 \| 0.651 \| 0.694 \| 0.700 \| 0.702 \| 0.807 \|
	\|RuBQReranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|RuBQRetrieval \| NDCG@10 \| 0.298 \| 0.124 \| 0.622 \| 0.657 \| 0.685 \| 0.696 \| 0.741 \|
	\|RuReviewsClassification \| Accuracy \| 0.589 \| 0.583 \| 0.599 \| 0.632 \| 0.612 \| 0.630 \| 0.653 \|
	\|RuSTSBenchmarkSTS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|RuSciBenchGRNTIClassification \| Accuracy \| 0.542 \| 0.539 \| 0.529 \| 0.569 \| 0.550 \| 0.563 \| 0.582 \|
	\|RuSciBenchGRNTIClusteringP2P \| V-measure \| 0.522 \| 0.504 \| 0.486 \| 0.517 \| 0.511 \| 0.516 \| 0.520 \|
	\|RuSciBenchOECDClassification \| Accuracy \| 0.438 \| 0.430 \| 0.406 \| 0.440 \| 0.427 \| 0.423 \| 0.445 \|
	\|RuSciBenchOECDClusteringP2P \| V-measure \| 0.473 \| 0.464 \| 0.426 \| 0.452 \| 0.443 \| 0.448 \| 0.450 \|
	\|SensitiveTopicsClassification \| Accuracy \| 0.285 \| 0.280 \| 0.262 \| 0.272 \| 0.228 \| 0.234 \| 0.257 \|
	\|TERRaClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|

	\|Model Name \| Metric \| sbert_large_ mt_nlu_ru \| sbert_large_ nlu_ru \| LaBSE-ru-sts \| [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) \| multilingual-e5-small \| multilingual-e5-base \| multilingual-e5-large \|
	\|:----------------------------------\|:--------------------\|-----------------------:\|--------------------:\|----------------:\|------------------:\|----------------------:\|----------------------:\|---------------------:\|
	\|Classification \| Accuracy \| 0.554 \| 0.552 \| 0.524 \| 0.558 \| 0.551 \| 0.561 \| 0.588 \|
	\|Clustering \| V-measure \| 0.526 \| 0.519 \| 0.513 \| 0.538 \| 0.513 \| 0.503 \| 0.525 \|
	\|MultiLabelClassification \| Accuracy \| 0.326 \| 0.319 \| 0.340 \| 0.361 \| 0.314 \| 0.329 \| 0.353 \|
	\|PairClassification \| Average Precision \| 0.520 \| 0.502 \| 0.587 \| 0.585 \| 0.551 \| 0.550 \| 0.584 \|
	\|Reranking \| MAP@10 \| 0.561 \| 0.468 \| 0.688 \| 0.687 \| 0.715 \| 0.720 \| 0.756 \|
	\|Retrieval \| NDCG@10 \| 0.256 \| 0.118 \| 0.637 \| 0.675 \| 0.697 \| 0.699 \| 0.774 \|
	\|STS \| Pearson correlation \| 0.712 \| 0.588 \| 0.788 \| 0.822 \| 0.781 \| 0.796 \| 0.831 \|
	\|Average \| Average \| 0.494 \| 0.438 \| 0.582 \| 0.604 \| 0.588 \| 0.594 \| 0.630 \|