|
# BART-Based Text Summarization Model for News Aggregation |
|
|
|
This repository hosts a BART transformer model fine-tuned for abstractive text summarization of news articles. It is designed to condense lengthy news reports into concise, informative summaries, enhancing user experience for news readers and aggregators. |
|
|
|
## Model Details |
|
|
|
- **Model Architecture:** BART (Facebook's BART-base) |
|
- **Task:** Abstractive Text Summarization |
|
- **Domain:** News Articles |
|
- **Dataset:** Reddit-TIFU (Hugging Face Datasets) |
|
- **Fine-tuning Framework:** Hugging Face Transformers |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install datasets transformers rouge-score evaluate |
|
``` |
|
|
|
### Loading the Model |
|
|
|
```python |
|
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq |
|
import torch |
|
|
|
# Load tokenizer and model |
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
model_name = "facebook/bart-base" |
|
tokenizer = BartTokenizer.from_pretrained(model_name) |
|
model = BartForConditionalGeneration.from_pretrained(model_name).to(device) |
|
``` |
|
|
|
## Performance Metrics |
|
|
|
- **Rouge1 :** 25.500000 |
|
- **Rouge2 :** 7.860000 |
|
- **Rougel :** 20.640000 |
|
- **Rougelsum :** 21.180000 |
|
|
|
|
|
## Fine-Tuning Details |
|
|
|
### Dataset |
|
|
|
The dataset is sourced from Hugging Faceβs Reddit-TIFU dataset. It contains 79,000 reddit post and their summaries. |
|
The original training and testing sets were merged, shuffled, and re-split using an 90/10 ratio. |
|
|
|
### Training Configuration |
|
|
|
- **Epochs:** 3 |
|
- **Batch Size:** 8 |
|
- **Learning Rate:** 2e-5 |
|
- **Evaluation Strategy:** epoch |
|
|
|
### Quantization |
|
|
|
Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency. |
|
|
|
## Repository Structure |
|
|
|
``` |
|
. |
|
βββ config.json |
|
βββ tokenizer_config.json |
|
βββ sepcial_tokens_map.json |
|
βββ tokenizer.json |
|
βββ model.safetensors # Fine Tuned Model |
|
βββ README.md # Model documentation |
|
``` |
|
|
|
## Limitations |
|
|
|
- The model may not generalize well to domains outside the fine-tuning dataset. |
|
|
|
- Quantization may result in minor accuracy degradation compared to full-precision models. |
|
|
|
## Contributing |
|
|
|
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements. |
|
|