|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- google-t5/t5-base |
|
datasets: |
|
- abisee/cnn_dailymail |
|
metrics: |
|
- rouge |
|
--- |
|
# T5-Base-Sum |
|
|
|
This model is a fine-tuned version of `T5` for summarization tasks. It was finetuned on 25000 training samples from the CNN Dailymail trainset, and is hosted on Hugging Face for easy access and use. |
|
|
|
This model aspires to deliver precision, factual consistency, and conciseness, driven by a custom cyclic attention mechanism. |
|
|
|
## Model Usage |
|
|
|
Below is an example of how to load and use this model for summarization: |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
# Load the model and tokenizer from Hugging Face |
|
model = T5ForConditionalGeneration.from_pretrained("Vijayendra/T5-Base-Sum") |
|
tokenizer = T5Tokenizer.from_pretrained("Vijayendra/T5-Base-Sum") |
|
|
|
# Example of using the model for summarization |
|
article = """ |
|
Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company |
|
said. The policy includes the termination of accounts of anti-vaccine influencers. Tech giants have been criticised for not doing more to |
|
counter false health information on their sites. In July, US PresidentJoe Biden said social media platforms were largely responsible for |
|
people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue. YouTube, which is owned |
|
by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation |
|
about Covid vaccines. In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about |
|
vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B."We're expanding our medical |
|
misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and |
|
effective by local health authorities and the WHO," the post said, referring to the World Health Organization. |
|
""" |
|
inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=512, truncation=True) |
|
summary_ids = model.generate(inputs, max_length=150, min_length=100, length_penalty=2.0, num_beams=4, early_stopping=True) |
|
|
|
# Decode and print the summary |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
print("Summary:") |
|
print(summary) |
|
|
|
|
|
# Example of a random article (can replace this with any article) |
|
random_article = """ |
|
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. |
|
Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals. |
|
Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem-solving". |
|
As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. |
|
A quip in Tesler's Theorem says "AI is whatever hasn't been done yet. |
|
""" |
|
|
|
# Tokenize the input article |
|
inputs = tokenizer.encode("summarize: " + random_article, return_tensors="pt", max_length=512, truncation=True) |
|
|
|
# Generate summary |
|
summary_ids = model.generate(inputs, max_length=150, min_length=100, length_penalty=3.0, num_beams=7, early_stopping=False) |
|
|
|
# Decode and print the summary |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
print("Summary:") |
|
print(summary) |
|
|
|
#Compare with some other models |
|
|
|
from transformers import T5ForConditionalGeneration, T5Tokenizer, PegasusTokenizer, PegasusForConditionalGeneration, BartForConditionalGeneration, BartTokenizer |
|
|
|
# Function to summarize with any model |
|
def summarize_article(article, model, tokenizer): |
|
inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=512, truncation=True) |
|
summary_ids = model.generate(inputs, max_length=150, min_length=100, length_penalty=2.0, num_beams=4, early_stopping=True) |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
return summary |
|
|
|
# Load our fine-tuned T5 model and tokenizer |
|
t5_model_custom = T5ForConditionalGeneration.from_pretrained("Vijayendra/T5-Base-Sum") |
|
t5_tokenizer_custom = T5Tokenizer.from_pretrained("Vijayendra/T5-Base-Sum") |
|
|
|
# Load a different pretrained T5 model for summarization (e.g., "t5-small" fine-tuned on CNN/DailyMail) |
|
t5_model_pretrained = T5ForConditionalGeneration.from_pretrained("csebuetnlp/mT5_multilingual_XLSum") |
|
t5_tokenizer_pretrained = T5Tokenizer.from_pretrained("csebuetnlp/mT5_multilingual_XLSum") |
|
|
|
# Load Pegasus model and tokenizer |
|
pegasus_model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum") |
|
pegasus_tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum") |
|
|
|
# Load BART model and tokenizer |
|
bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn") |
|
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn") |
|
|
|
# Example article for summarization |
|
article = """ |
|
Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company |
|
said. The policy includes the termination of accounts of anti-vaccine influencers. Tech giants have been criticised for not doing more to |
|
counter false health information on their sites. In July, US PresidentJoe Biden said social media platforms were largely responsible for |
|
people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue. YouTube, which is owned |
|
by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation |
|
about Covid vaccines. In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about |
|
vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B."We're expanding our medical |
|
misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and |
|
effective by local health authorities and the WHO," the post said, referring to the World Health Organization. |
|
""" |
|
|
|
# Summarize with our fine-tuned T5 model |
|
t5_summary_custom = summarize_article(article, t5_model_custom, t5_tokenizer_custom) |
|
|
|
# Summarize with the pretrained T5 model for summarization |
|
t5_summary_pretrained = summarize_article(article, t5_model_pretrained, t5_tokenizer_pretrained) |
|
|
|
# Summarize with Pegasus model |
|
pegasus_summary = summarize_article(article, pegasus_model, pegasus_tokenizer) |
|
|
|
# Summarize with BART model |
|
bart_summary = summarize_article(article, bart_model, bart_tokenizer) |
|
|
|
# Print summaries for comparison |
|
print("T5 base with Cyclic Attention Summary:") |
|
print(t5_summary_custom) |
|
print("\nPretrained mT5_multilingual_XLSum Summary:") |
|
print(t5_summary_pretrained) |
|
print("\nPegasus Xsum Summary:") |
|
print(pegasus_summary) |
|
print("\nBART Large CNN Summary:") |
|
print(bart_summary) |
|
|
|
|