alykassem
/

FLAN-T5-Paraphraser

Text2Text Generation

PyTorch

English

Model card Files Files and versions Community

alykassem commited on 22 days ago

Commit

55be723

verified ·

1 Parent(s): e84dea6

Update README.md

Browse files

Files changed (1) hide show

README.md +153 -3

README.md CHANGED Viewed

@@ -1,3 +1,153 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- google/flan-t5-large
+pipeline_tag: text2text-generation
+metrics:
+- bertscore
+---
+# Targeted Paraphrasing Model for Adversarial Data Generation
+This repository provides the **UN-Targeted Paraphrasing Model**, developed as part of the research presented in the paper:
+**"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion."**
+The model is designed to generate high-quality paraphrases with enhanced fluency, diversity, and relevance, and is tailored for applications in adversarial data generation.
+---
+## Table of Contents
+1. [Paraphrasing Datasets](#paraphrasing-datasets)
+2. [Model Description](#model-description)
+3. [Applications](#applications)
+    - [Installation](#installation)
+    - [Usage](#usage)
+4. [Citation](#citation)
+---
+## Paraphrasing Datasets
+The training process utilized a meticulously curated dataset comprising 560,550 paraphrase pairs from seven high-quality sources:
+- **APT Dataset** (Nighojkar and Licato, 2021)
+- **Microsoft Research Paraphrase Corpus (MSRP)** (Dolan and Brockett, 2005)
+- **PARANMT-50M** (Wieting and Gimpel, 2018)
+- **TwitterPPDB** (Lan et al., 2017)
+- **PIT-2015** (Xu et al., 2015)
+- **PARADE** (He et al., 2020)
+- **Quora Question Pairs (QQP)** (Iyer et al., 2017)
+Filtering steps were applied to ensure high-quality and diverse data:
+1. Removal of pairs with over 50% unigram overlap to improve lexical diversity.
+2. Elimination of pairs with less than 50% reordering of shared words for syntactic diversity.
+3. Filtering out pairs with less than 50% semantic similarity, leveraging cosine similarity scores from the "all-MiniLM-L12-v2" model.
+4. Discarding pairs with over 70% trigram overlap to enhance diversity.
+The refined dataset consists of 96,073 samples, split into training (76,857), validation (9,608), and testing (9,608) subsets.
+---
+## Model Description
+The paraphrasing model is built upon **FLAN-5-large** and fine-tuned on the filtered dataset for nine epochs. Key features include:
+- **Performance:** Achieves an F1 BERT-Score of 75.925%, reflecting superior fluency and paraphrasing ability.
+- **Task-Specificity:** Focused training on relevant pairs ensures high-quality task-specific outputs.
+- **Enhanced Generation:** Generates paraphrases introducing new information about entities or objects, improving overall generation quality.
+---
+## Applications
+This model is primarily designed to create adversarial training samples that effectively uncover edge cases in machine learning models while maintaining minimal distribution distortion.
+Additionally, the model is suitable for **general paraphrasing purposes**, making it a versatile tool for generating high-quality paraphrases across various contexts. It is compatible with the **Parrot paraphrasing library** for seamless integration and usage. Below is an example of how to use the model with the Parrot library:
+### Installation
+To install the Parrot library, run:
+```bash
+pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git
+```
+### Usage
+## In Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+model_name = "alykassem/FLAN-T5-Paraphraser"
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Load the model
+model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+# Example usage: Tokenize input and generate output
+input_text = "Paraphrase: How are you?"
+inputs = tokenizer(input_text, return_tensors="pt")
+# Generate response
+outputs = model.generate(**inputs)
+decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print("Generated text:", decoded_output)
+```
+## In Parrot
+```python
+from parrot import Parrot
+import torch
+import warnings
+warnings.filterwarnings("ignore")
+# Uncomment to get reproducible paraphrase generations
+# def random_state(seed):
+#     torch.manual_seed(seed)
+#     if torch.cuda.is_available():
+#         torch.cuda.manual_seed_all(seed)
+# random_state(1234)
+# Initialize the Parrot model (ensure initialization occurs only once in your code)
+parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)
+phrases = [
+    "Can you recommend some upscale restaurants in New York?",
+    "What are the famous places we should not miss in Russia?"
+]
+for phrase in phrases:
+    print("-" * 100)
+    print("Input Phrase: ", phrase)
+    print("-" * 100)
+    para_phrases = parrot.augment(input_phrase=phrase)
+    for para_phrase in para_phrases:
+        print(para_phrase)
+```
+---
+## Citation
+If you find this work or model useful, please cite the paper:
+```
+@inproceedings{kassem-saad-2024-finding,
+    title = "Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion",
+    author = "Kassem, Aly  and
+      Saad, Sherif",
+    editor = "Graham, Yvette  and
+      Purver, Matthew",
+    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = mar,
+    year = "2024",
+    address = "St. Julian{'}s, Malta",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2024.eacl-long.33/",
+    pages = "552--572",
+}
+```