alykassem commited on
Commit
55be723
·
verified ·
1 Parent(s): e84dea6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -3
README.md CHANGED
@@ -1,3 +1,153 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - google/flan-t5-large
7
+ pipeline_tag: text2text-generation
8
+ metrics:
9
+ - bertscore
10
+ ---
11
+
12
+ # Targeted Paraphrasing Model for Adversarial Data Generation
13
+
14
+ This repository provides the **UN-Targeted Paraphrasing Model**, developed as part of the research presented in the paper:
15
+ **"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion."**
16
+
17
+ The model is designed to generate high-quality paraphrases with enhanced fluency, diversity, and relevance, and is tailored for applications in adversarial data generation.
18
+
19
+ ---
20
+
21
+ ## Table of Contents
22
+
23
+ 1. [Paraphrasing Datasets](#paraphrasing-datasets)
24
+ 2. [Model Description](#model-description)
25
+ 3. [Applications](#applications)
26
+ - [Installation](#installation)
27
+ - [Usage](#usage)
28
+ 4. [Citation](#citation)
29
+
30
+ ---
31
+
32
+ ## Paraphrasing Datasets
33
+
34
+ The training process utilized a meticulously curated dataset comprising 560,550 paraphrase pairs from seven high-quality sources:
35
+ - **APT Dataset** (Nighojkar and Licato, 2021)
36
+ - **Microsoft Research Paraphrase Corpus (MSRP)** (Dolan and Brockett, 2005)
37
+ - **PARANMT-50M** (Wieting and Gimpel, 2018)
38
+ - **TwitterPPDB** (Lan et al., 2017)
39
+ - **PIT-2015** (Xu et al., 2015)
40
+ - **PARADE** (He et al., 2020)
41
+ - **Quora Question Pairs (QQP)** (Iyer et al., 2017)
42
+
43
+ Filtering steps were applied to ensure high-quality and diverse data:
44
+ 1. Removal of pairs with over 50% unigram overlap to improve lexical diversity.
45
+ 2. Elimination of pairs with less than 50% reordering of shared words for syntactic diversity.
46
+ 3. Filtering out pairs with less than 50% semantic similarity, leveraging cosine similarity scores from the "all-MiniLM-L12-v2" model.
47
+ 4. Discarding pairs with over 70% trigram overlap to enhance diversity.
48
+
49
+ The refined dataset consists of 96,073 samples, split into training (76,857), validation (9,608), and testing (9,608) subsets.
50
+
51
+ ---
52
+
53
+ ## Model Description
54
+
55
+ The paraphrasing model is built upon **FLAN-5-large** and fine-tuned on the filtered dataset for nine epochs. Key features include:
56
+ - **Performance:** Achieves an F1 BERT-Score of 75.925%, reflecting superior fluency and paraphrasing ability.
57
+ - **Task-Specificity:** Focused training on relevant pairs ensures high-quality task-specific outputs.
58
+ - **Enhanced Generation:** Generates paraphrases introducing new information about entities or objects, improving overall generation quality.
59
+
60
+ ---
61
+
62
+ ## Applications
63
+
64
+ This model is primarily designed to create adversarial training samples that effectively uncover edge cases in machine learning models while maintaining minimal distribution distortion.
65
+
66
+ Additionally, the model is suitable for **general paraphrasing purposes**, making it a versatile tool for generating high-quality paraphrases across various contexts. It is compatible with the **Parrot paraphrasing library** for seamless integration and usage. Below is an example of how to use the model with the Parrot library:
67
+
68
+ ### Installation
69
+ To install the Parrot library, run:
70
+ ```bash
71
+ pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
72
+ ```
73
+
74
+ ### Usage
75
+ ## In Transformers
76
+ ```python
77
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
78
+
79
+ model_name = "alykassem/FLAN-T5-Paraphraser"
80
+
81
+ # Load the tokenizer
82
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
83
+
84
+ # Load the model
85
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
86
+
87
+ # Example usage: Tokenize input and generate output
88
+ input_text = "Paraphrase: How are you?"
89
+ inputs = tokenizer(input_text, return_tensors="pt")
90
+
91
+ # Generate response
92
+ outputs = model.generate(**inputs)
93
+ decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
94
+
95
+ print("Generated text:", decoded_output)
96
+
97
+ ```
98
+
99
+
100
+ ## In Parrot
101
+ ```python
102
+ from parrot import Parrot
103
+ import torch
104
+ import warnings
105
+ warnings.filterwarnings("ignore")
106
+
107
+ # Uncomment to get reproducible paraphrase generations
108
+ # def random_state(seed):
109
+ # torch.manual_seed(seed)
110
+ # if torch.cuda.is_available():
111
+ # torch.cuda.manual_seed_all(seed)
112
+
113
+ # random_state(1234)
114
+
115
+ # Initialize the Parrot model (ensure initialization occurs only once in your code)
116
+ parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)
117
+
118
+ phrases = [
119
+ "Can you recommend some upscale restaurants in New York?",
120
+ "What are the famous places we should not miss in Russia?"
121
+ ]
122
+
123
+ for phrase in phrases:
124
+ print("-" * 100)
125
+ print("Input Phrase: ", phrase)
126
+ print("-" * 100)
127
+ para_phrases = parrot.augment(input_phrase=phrase)
128
+ for para_phrase in para_phrases:
129
+ print(para_phrase)
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Citation
135
+
136
+ If you find this work or model useful, please cite the paper:
137
+
138
+ ```
139
+ @inproceedings{kassem-saad-2024-finding,
140
+ title = "Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion",
141
+ author = "Kassem, Aly and
142
+ Saad, Sherif",
143
+ editor = "Graham, Yvette and
144
+ Purver, Matthew",
145
+ booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
146
+ month = mar,
147
+ year = "2024",
148
+ address = "St. Julian{'}s, Malta",
149
+ publisher = "Association for Computational Linguistics",
150
+ url = "https://aclanthology.org/2024.eacl-long.33/",
151
+ pages = "552--572",
152
+ }
153
+ ```