Ogamon commited on
Commit
65d7673
·
verified ·
1 Parent(s): 45e6672

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -1,4 +1,72 @@
1
  ---
2
  language:
3
  - en
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - en
4
+ ---
5
+
6
+ # Controllable Paraphrase Generation for Semantic and Lexical Similarities
7
+
8
+ **Con**trollable **P**araphrase **G**eneration for Semantic and Lexical **S**imilarities model (**ConPGS model**)
9
+
10
+ Github : https://github.com/Ogamon958/ConPGS
11
+
12
+ ## Paraphrase Corpora
13
+ https://drive.google.com/drive/folders/1V96SiVkgzlW9bn98K3S0q968vfoTOPy7?usp=sharing
14
+ (Constructing paraphrase corpora from wiki40b with ConPGS model)
15
+
16
+
17
+ ## How to use our model
18
+
19
+ ```
20
+ #setup
21
+ import torch
22
+ from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
23
+ model = BartForConditionalGeneration.from_pretrained('Ogamon/conpgs_model')
24
+ tokenizer = BartTokenizer.from_pretrained('Ogamon/conpgs_model')
25
+ device= torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
26
+ model.to(device)
27
+
28
+ #tags
29
+ sim_token = {70:"<SIM70>", 75:"<SIM75>",80:"<SIM80>",85:"<SIM85>",90:"<SIM90>",95:"<SIM95>"}
30
+ bleu_token={5:"<BLEU0_5>",10:"<BLEU10>",15:"<BLEU15>",20:"<BLEU20>",25:"<BLEU25>",30:"<BLEU30>",35:"<BLEU35>",40:"<BLEU40>"}
31
+ ```
32
+
33
+ ```
34
+ #edit here
35
+ text = "The tiger sanctuary has been told their 147 cats must be handed over."
36
+ sim = sim_token[95] #70,75,80,85,90,95
37
+ bleu = bleu_token[5] #5,10,15,20,25,30,35,40
38
+
39
+
40
+
41
+ #evaluate
42
+ model.eval()
43
+ with torch.no_grad():
44
+ input_text=f"{sim} {bleu} {text}"
45
+ inputs = tokenizer.encode(input_text, return_tensors="pt",truncation=True).to(device)
46
+ length=inputs.size()[1]
47
+ max_len=int(length*1.5)
48
+ min_len=int(length*0.75)
49
+
50
+ summary_ids = model.generate(inputs,max_length=max_len,min_length=min_len,num_beams=5)
51
+ summary = tokenizer.decode(summary_ids[0],skip_special_tokens=True)
52
+ print(summary)
53
+ #The tiger sanctuary was told to hand over its 147 cats.
54
+ ```
55
+
56
+
57
+ ## Citation
58
+ Please cite our LREC-COLING2024 paper if you use this repository:
59
+ (Details updated after submission.)
60
+
61
+ ```
62
+ @inproceedings{ogasa-2024-lrec-coling,
63
+ title = {{Controllable Paraphrase Generation for Semantic and Lexical Similarities}},
64
+ author = "Ogasa, Yuya and Kajiwara, Tomoyuki and Arase, Yuki",
65
+ booktitle = "The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",
66
+ month = may,
67
+ year = "2024",
68
+ address = "Torino, Italia",
69
+ publisher = "Association for Computational Linguistics",
70
+ abstract = "We developed a controllable paraphrase generation model for semantic and lexical similarities using a simple and intuitive mechanism: attaching tags to specify these values at the head of the input sentence. Lexically diverse paraphrases have been long coveted for data augmentation. However, their generation is not straightforward because diversifying surfaces easily degrades semantic similarity. Furthermore, our experiments revealed two critical features in data augmentation by paraphrasing: appropriate similarities of paraphrases are highly downstream task-dependent, and mixing paraphrases of various similarities negatively affects the downstream tasks. These features indicated that the controllability in paraphrase generation is crucial for successful data augmentation. We tackled these challenges by fine-tuning a pre-trained sequence-to-sequence model employing tags that indicate the semantic and lexical similarities of synthetic paraphrases selected carefully based on the similarities. The resultant model could paraphrase an input sentence according to the tags specified. Extensive experiments on data augmentation for contrastive learning and pre-fine-tuning of pretrained masked language models confirmed the effectiveness of the proposed model. We release our paraphrase generation model and a corpus of 87 million diverse paraphrases. (https://github.com/Ogamon958/ConPGS)"
71
+ }
72
+ ```