--- language: - en license: - apache-2.0 - bsd-3-clause tags: - summarization - extractive - summary - abstractive - multi-task - document summary datasets: - jordiclive/scored_summarization_datasets metrics: - rouge --- # Multi-purpose Summarizer (Fine-tuned 3B google/flan-t5-xl on several Summarization datasets) Open In Colab A fine-tuned version of [google/flan-t5-xl](https://huggingface.co/google/flan-t5-xl) on various summarization datasets (xsum, wikihow, cnn_dailymail/3.0.0, samsum, scitldr/AIC, billsum, TLDR) Goal: a model that can be used for a general-purpose summarizer for academic and general usage. Control over the type of summary can be given by varying the instruction prepended to the source document. The result works well on lots of text, although trained with a max source length of 512 tokens and 150 max summary length. - See the Colab demo linked above or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text) > Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below. --- ## Usage Check the colab notebook. **The model expects a prompt prepended to the source document to indicate the type of summary**, examples of prompts used to train the model here: ``` prompts = { "article": "Produce an article summary of the following news article:", "one_sentence": "Given the following news article, summarize the article in one sentence:", "conversation": "Briefly summarize in third person the following conversation:", "scitldr": "Given the following scientific article, provide a TL;DR summary:", "bill": "Summarize the following proposed legislation (bill):", "outlines": "Produce an article summary including outlines of each paragraph of the following article:", } ``` After `pip install transformers` run the following code: ```python from transformers import pipeline summarizer = pipeline("summarization", "jordiclive/flan-t5-3b-summarizer", torch_dtype=torch.bfloat16) raw_document = 'You must be 18 years old to live or work in New York State...' prompt = "Produce an article summary of the following news article:" results = summarizer( f"{prompt} {raw_document}", num_beams=5, min_length=5, no_repeat_ngram_size=3, skip_special_tokens=True, truncation=True, max_length=512, ) ``` **For Batch Inference:** see [this discussion thread](https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis/discussions/1) for details, but essentially the dataset consists of several sentences at a time, and so I'd recommend running inference **in the same fashion:** batches of 64-96 tokens ish (or, 2-3 sentences split with regex) - it is also helpful to **first** check whether or not a given sentence needs grammar correction before using the text2text model. You can do this with BERT-type models fine-tuned on CoLA like `textattack/roberta-base-CoLA` - I made a notebook demonstrating batch inference [here](https://colab.research.google.com/gist/pszemraj/6e961b08970f98479511bb1e17cdb4f0/batch-grammar-check-correct-demo.ipynb) --- ## Training procedure - Training was done in BF16, deepspeed stage 2 for 6 epochs with ROUGE2 monitored on the validation set. ## Hardware - GPU count 8 NVIDIA A100-SXM4-40GB - CPU count 48 ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 3e-05 - train_batch_size: 5 - eval_batch_size: 8 - seed: 42 - distributed_type: multi-GPU - gradient_accumulation_steps: 2 - effective_train_batch_size: 80 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - warmup_steps: 2000 - num_epochs: 10 ### Framework versions - Transformers 4.24.0 - Pytorch 1.9.1+cu111 - Deepspeed 0.7.4 - Pytorch-lightning 1.8.1