|
--- |
|
language: |
|
- ru |
|
tags: |
|
- sentence-similarity |
|
- text-classification |
|
datasets: |
|
- merionum/ru_paraphraser |
|
- RuPAWS |
|
--- |
|
|
|
|
|
This is a cross-encoder model trained to predict semantic equivalence of two Russian sentences. |
|
|
|
It classifies text pairs as paraphrases (class 1) or non-paraphrases (class 0). Its scores can be used as a metric of content preservation for paraphrasing or text style transfer. |
|
|
|
It is a [sberbank-ai/ruRoberta-large](https://huggingface.co/sberbank-ai/ruRoberta-large) model fine-tuned on a union of 3 datasets: |
|
1. `RuPAWS`: https://github.com/ivkrotova/rupaws_dataset based on Quora and QQP; |
|
2. `ru_paraphraser`: https://huggingface.co/merionum/ru_paraphraser; |
|
3. Results of the manual check of content preservation for the [RUSSE-2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf) text detoxification dataset collection (`content_5.tsv`). |
|
|
|
The task was formulated as binary classification: whether the two sentences have the same meaning (1) or different (0). |
|
|
|
The table shows the training dataset size after duplication (joining `text1 + text2` and `text2 + text1` pairs): |
|
|
|
source \ label | 0 | 1 |
|
-- | -- | -- |
|
detox | 1412| 3843 |
|
paraphraser |5539 | 1688 |
|
rupaws_qqp |1112 | 792 |
|
rupaws_wiki |3526 | 2166 |
|
|
|
The model was trained with Adam optimizer and the following hyperparameters: |
|
|
|
``` |
|
learning_rate = 1e-5 |
|
batch_size = 8 |
|
gradient_accumulation_steps = 4 |
|
n_epochs = 3 |
|
max_grad_norm = 1.0 |
|
``` |
|
|
|
After training, the model had the following ROC AUC scores on the test sets: |
|
set | ROC AUC |
|
- | - |
|
detox | 0.857112 |
|
paraphraser | 0.858465 |
|
rupaws_qqp | 0.859195 |
|
rupaws_wiki | 0.906121 |
|
|
|
Example usage: |
|
|
|
```Python |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained('SkolkovoInstitute/ruRoberta-large-paraphrase-v1') |
|
tokenizer = AutoTokenizer.from_pretrained('SkolkovoInstitute/ruRoberta-large-paraphrase-v1') |
|
|
|
def get_similarity(text1, text2): |
|
""" Predict the probability that two Russian sentences are paraphrases of each other. """ |
|
with torch.inference_mode(): |
|
batch = tokenizer( |
|
text1, text2, |
|
truncation=True, max_length=model.config.max_position_embeddings, return_tensors='pt', |
|
).to(model.device) |
|
proba = torch.softmax(model(**batch).logits, -1) |
|
return proba[0][1].item() |
|
|
|
print(get_similarity('Я тебя люблю', 'Ты мне нравишься')) # 0.9798 |
|
print(get_similarity('Я тебя люблю', 'Я тебя ненавижу')) # 0.0008 |
|
``` |