|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- google/flan-t5-large |
|
pipeline_tag: text2text-generation |
|
metrics: |
|
- bertscore |
|
--- |
|
|
|
# Targeted Paraphrasing Model for Adversarial Data Generation |
|
|
|
This repository provides the **(UN)-Targeted Paraphrasing Model**, developed as part of the research presented in the paper: |
|
**"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion."** |
|
|
|
The model is designed to generate high-quality paraphrases with enhanced fluency, diversity, and relevance, and is tailored for applications in adversarial data generation. |
|
|
|
--- |
|
|
|
## Table of Contents |
|
|
|
1. [Paraphrasing Datasets](#paraphrasing-datasets) |
|
2. [Model Description](#model-description) |
|
3. [Applications](#applications) |
|
- [Installation](#installation) |
|
- [Usage](#usage) |
|
4. [Citation](#citation) |
|
|
|
--- |
|
|
|
## Paraphrasing Datasets |
|
|
|
The training process utilized a meticulously curated dataset comprising 560,550 paraphrase pairs from seven high-quality sources: |
|
- **APT Dataset** (Nighojkar and Licato, 2021) |
|
- **Microsoft Research Paraphrase Corpus (MSRP)** (Dolan and Brockett, 2005) |
|
- **PARANMT-50M** (Wieting and Gimpel, 2018) |
|
- **TwitterPPDB** (Lan et al., 2017) |
|
- **PIT-2015** (Xu et al., 2015) |
|
- **PARADE** (He et al., 2020) |
|
- **Quora Question Pairs (QQP)** (Iyer et al., 2017) |
|
|
|
Filtering steps were applied to ensure high-quality and diverse data: |
|
1. Removal of pairs with over 50% unigram overlap to improve lexical diversity. |
|
2. Elimination of pairs with less than 50% reordering of shared words for syntactic diversity. |
|
3. Filtering out pairs with less than 50% semantic similarity, leveraging cosine similarity scores from the "all-MiniLM-L12-v2" model. |
|
4. Discarding pairs with over 70% trigram overlap to enhance diversity. |
|
|
|
The refined dataset consists of 96,073 samples, split into training (76,857), validation (9,608), and testing (9,608) subsets. |
|
|
|
--- |
|
|
|
## Model Description |
|
|
|
The paraphrasing model is built upon **FLAN-T5-large** and fine-tuned on the filtered dataset for nine epochs. Key features include: |
|
- **Performance:** Achieves an F1 BERT-Score of 75.925%, reflecting superior fluency and paraphrasing ability. |
|
- **Task-Specificity:** Focused training on relevant pairs ensures high-quality task-specific outputs. |
|
- **Enhanced Generation:** Generates paraphrases introducing new information about entities or objects, improving overall generation quality. |
|
|
|
--- |
|
|
|
## Applications |
|
|
|
This model is primarily designed to create adversarial training samples that effectively uncover edge cases in machine learning models while maintaining minimal distribution distortion. |
|
|
|
Additionally, the model is suitable for **general paraphrasing purposes**, making it a versatile tool for generating high-quality paraphrases across various contexts. It is compatible with the **Parrot paraphrasing library** for seamless integration and usage. Below is an example of how to use the model with the Parrot library: |
|
|
|
### Installation |
|
To install the Parrot library, run: |
|
```bash |
|
pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git |
|
``` |
|
|
|
### Usage |
|
## In Transformers |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_name = "alykassem/FLAN-T5-Paraphraser" |
|
|
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Load the model |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
# Example usage: Tokenize input and generate output |
|
input_text = "Paraphrase: How are you?" |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
|
# Generate response |
|
outputs = model.generate(**inputs) |
|
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print("Generated text:", decoded_output) |
|
|
|
``` |
|
|
|
|
|
## In Parrot |
|
```python |
|
from parrot import Parrot |
|
import torch |
|
import warnings |
|
warnings.filterwarnings("ignore") |
|
|
|
# Uncomment to get reproducible paraphrase generations |
|
# def random_state(seed): |
|
# torch.manual_seed(seed) |
|
# if torch.cuda.is_available(): |
|
# torch.cuda.manual_seed_all(seed) |
|
|
|
# random_state(1234) |
|
|
|
# Initialize the Parrot model (ensure initialization occurs only once in your code) |
|
parrot = Parrot(model_tag="alykassem/FLAN-T5-Paraphraser", use_gpu=True) |
|
|
|
phrases = [ |
|
"Can you recommend some upscale restaurants in New York?", |
|
"What are the famous places we should not miss in Russia?" |
|
] |
|
|
|
for phrase in phrases: |
|
print("-" * 100) |
|
print("Input Phrase: ", phrase) |
|
print("-" * 100) |
|
para_phrases = parrot.augment(input_phrase=phrase) |
|
for para_phrase in para_phrases: |
|
print(para_phrase) |
|
``` |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
If you find this work or model useful, please cite the paper: |
|
|
|
``` |
|
@inproceedings{kassem-saad-2024-finding, |
|
title = "Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion", |
|
author = "Kassem, Aly and |
|
Saad, Sherif", |
|
editor = "Graham, Yvette and |
|
Purver, Matthew", |
|
booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)", |
|
month = mar, |
|
year = "2024", |
|
address = "St. Julian{'}s, Malta", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.eacl-long.33/", |
|
pages = "552--572", |
|
} |
|
``` |