File size: 4,656 Bytes
45e6672
 
 
65d7673
 
ca282db
 
65d7673
4e4af1f
750895b
feb557d
65d7673
bfb6c97
579e4e2
 
 
bfb6c97
ecc0d67
b7f440e
65d7673
cf9a724
3b39311
65d7673
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43f261c
65d7673
 
43f261c
 
 
 
 
 
 
 
 
 
 
 
65d7673
 
 
43f261c
 
 
 
65d7673
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
- en
---

# ConPGS Model
**Con**trollable **P**araphrase **G**eneration for Semantic and Lexical **S**imilarities Model

It was introduced in [the LREC-COLING 2024 paper: Controllable Paraphrase Generation for Semantic and Lexical Similarities](https://aclanthology.org/2024.lrec-main.348/).

Github: https://github.com/Ogamon958/ConPGS

## Model Description
ConPGS Model is capable of generating paraphrases while controlling for **semantic** and **lexical** similarity.  
The **semantic** similarity can be controlled at six levels - 70, 75, 80, 85, 90 and 95 - the higher the level, the more similar the model outputs sentences with similar meanings. This value is expressed as **sim**.  
The **lexical** similarity can be controlled in eight steps 5, 10, 15, 20, 25, 30, 35 and 40, the higher the level, the more the model outputs sentences with similar surfaces. This value is expressed as **bleu**.

This model was constructed by fine-tuning [BART-large](https://huggingface.co/facebook/bart-large).


## How to use
Here is how to use this model in PyTorch:

```
#setup
import torch
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
model = BartForConditionalGeneration.from_pretrained('Ogamon/conpgs_model')
tokenizer = BartTokenizer.from_pretrained('Ogamon/conpgs_model')
device= torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)

#tags
sim_token = {70:"<SIM70>", 75:"<SIM75>",80:"<SIM80>",85:"<SIM85>",90:"<SIM90>",95:"<SIM95>"}
bleu_token={5:"<BLEU0_5>",10:"<BLEU10>",15:"<BLEU15>",20:"<BLEU20>",25:"<BLEU25>",30:"<BLEU30>",35:"<BLEU35>",40:"<BLEU40>"}
```

```
#edit here
text = "The tiger sanctuary has been told their 147 cats must be handed over."
sim = sim_token[95] #70,75,80,85,90,95
bleu = bleu_token[5] #5,10,15,20,25,30,35,40 



#evaluate
model.eval()
with torch.no_grad():
    input_text=f"{sim} {bleu} {text}"  
    inputs = tokenizer.encode(input_text, return_tensors="pt",truncation=True).to(device)
    length=inputs.size()[1]
    max_len=int(length*1.5)
    min_len=int(length*0.75)       
    
    summary_ids = model.generate(inputs,max_length=max_len,min_length=min_len,num_beams=5)
    summary = tokenizer.decode(summary_ids[0],skip_special_tokens=True)
    print(summary)
    #The tiger sanctuary was told to hand over its 147 cats.
```


## Citation
Please cite [our LREC-COLING2024 paper](https://aclanthology.org/2024.lrec-main.348/) if you use our model or paraphrase corpora:  

```
@inproceedings{ogasa-etal-2024-controllable-paraphrase,
    title = "Controllable Paraphrase Generation for Semantic and Lexical Similarities",
    author = "Ogasa, Yuya  and
      Kajiwara, Tomoyuki  and
      Arase, Yuki",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.348",
    pages = "3927--3942",
    abstract = "We developed a controllable paraphrase generation model for semantic and lexical similarities using a simple and intuitive mechanism: attaching tags to specify these values at the head of the input sentence. Lexically diverse paraphrases have been long coveted for data augmentation. However, their generation is not straightforward because diversifying surfaces easily degrades semantic similarity. Furthermore, our experiments revealed two critical features in data augmentation by paraphrasing: appropriate similarities of paraphrases are highly downstream task-dependent, and mixing paraphrases of various similarities negatively affects the downstream tasks. These features indicated that the controllability in paraphrase generation is crucial for successful data augmentation. We tackled these challenges by fine-tuning a pre-trained sequence-to-sequence model employing tags that indicate the semantic and lexical similarities of synthetic paraphrases selected carefully based on the similarities. The resultant model could paraphrase an input sentence according to the tags specified. Extensive experiments on data augmentation for contrastive learning and pre-fine-tuning of pretrained masked language models confirmed the effectiveness of the proposed model. We release our paraphrase generation model and a corpus of 87 million diverse paraphrases. (https://github.com/Ogamon958/ConPGS)",
}
```