File size: 5,430 Bytes
55be723
 
 
 
 
 
 
 
 
 
 
 
 
dfa3ba3
55be723
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8da7a25
55be723
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8da7a25
55be723
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: apache-2.0
language:
- en
base_model:
- google/flan-t5-large
pipeline_tag: text2text-generation
metrics:
- bertscore
---

# Targeted Paraphrasing Model for Adversarial Data Generation

This repository provides the **(UN)-Targeted Paraphrasing Model**, developed as part of the research presented in the paper:  
**"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion."**  

The model is designed to generate high-quality paraphrases with enhanced fluency, diversity, and relevance, and is tailored for applications in adversarial data generation.

---

## Table of Contents

1. [Paraphrasing Datasets](#paraphrasing-datasets)
2. [Model Description](#model-description)
3. [Applications](#applications)
    - [Installation](#installation)
    - [Usage](#usage)
4. [Citation](#citation)

---

## Paraphrasing Datasets

The training process utilized a meticulously curated dataset comprising 560,550 paraphrase pairs from seven high-quality sources:  
- **APT Dataset** (Nighojkar and Licato, 2021)  
- **Microsoft Research Paraphrase Corpus (MSRP)** (Dolan and Brockett, 2005)  
- **PARANMT-50M** (Wieting and Gimpel, 2018)  
- **TwitterPPDB** (Lan et al., 2017)  
- **PIT-2015** (Xu et al., 2015)  
- **PARADE** (He et al., 2020)  
- **Quora Question Pairs (QQP)** (Iyer et al., 2017)  

Filtering steps were applied to ensure high-quality and diverse data:
1. Removal of pairs with over 50% unigram overlap to improve lexical diversity.  
2. Elimination of pairs with less than 50% reordering of shared words for syntactic diversity.  
3. Filtering out pairs with less than 50% semantic similarity, leveraging cosine similarity scores from the "all-MiniLM-L12-v2" model.  
4. Discarding pairs with over 70% trigram overlap to enhance diversity.  

The refined dataset consists of 96,073 samples, split into training (76,857), validation (9,608), and testing (9,608) subsets.

---

## Model Description

The paraphrasing model is built upon **FLAN-T5-large** and fine-tuned on the filtered dataset for nine epochs. Key features include:  
- **Performance:** Achieves an F1 BERT-Score of 75.925%, reflecting superior fluency and paraphrasing ability.  
- **Task-Specificity:** Focused training on relevant pairs ensures high-quality task-specific outputs.  
- **Enhanced Generation:** Generates paraphrases introducing new information about entities or objects, improving overall generation quality.  

---

## Applications

This model is primarily designed to create adversarial training samples that effectively uncover edge cases in machine learning models while maintaining minimal distribution distortion.  

Additionally, the model is suitable for **general paraphrasing purposes**, making it a versatile tool for generating high-quality paraphrases across various contexts. It is compatible with the **Parrot paraphrasing library** for seamless integration and usage. Below is an example of how to use the model with the Parrot library:

### Installation
To install the Parrot library, run:
```bash
pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
```

### Usage
## In Transformers
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "alykassem/FLAN-T5-Paraphraser"  

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example usage: Tokenize input and generate output
input_text = "Paraphrase: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated text:", decoded_output)

```


## In Parrot
```python
from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

# Uncomment to get reproducible paraphrase generations
# def random_state(seed):
#     torch.manual_seed(seed)
#     if torch.cuda.is_available():
#         torch.cuda.manual_seed_all(seed)

# random_state(1234)

# Initialize the Parrot model (ensure initialization occurs only once in your code)
parrot = Parrot(model_tag="alykassem/FLAN-T5-Paraphraser", use_gpu=True)

phrases = [
    "Can you recommend some upscale restaurants in New York?",
    "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
    print("-" * 100)
    print("Input Phrase: ", phrase)
    print("-" * 100)
    para_phrases = parrot.augment(input_phrase=phrase)
    for para_phrase in para_phrases:
        print(para_phrase)
```

---

## Citation

If you find this work or model useful, please cite the paper:

```
@inproceedings{kassem-saad-2024-finding,
    title = "Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion",
    author = "Kassem, Aly  and
      Saad, Sherif",
    editor = "Graham, Yvette  and
      Purver, Matthew",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.33/",
    pages = "552--572",
}
```