songhieng's picture
Update README.md
a1bdd8d verified
---
license: apache-2.0
datasets:
- kimleang123/rfi_news
language:
- km
metrics:
- rouge
base_model:
- google/mt5-small
pipeline_tag: summarization
library_name: transformers
---
# Khmer mT5 Summarization Model (1024 Tokens)
## Introduction
This repository contains a fine-tuned mT5 model for Khmer text summarization, extending the capabilities of the original [khmer-mt5-summarization](https://huggingface.co/songhieng/khmer-mt5-summarization) model. The primary enhancement in this version is the support for summarizing longer texts, with training adjusted to accommodate inputs up to 1024 tokens.
## Model Details
- **Base Model:** `google/mt5-small`
- **Fine-tuned for:** Khmer text summarization with extended input length
- **Training Dataset:** `kimleang123/khmer-text-dataset`
- **Framework:** Hugging Face `transformers`
- **Task Type:** Sequence-to-Sequence (Seq2Seq)
- **Input:** Khmer text (articles, paragraphs, or documents) up to 1024 tokens
- **Output:** Summarized Khmer text
- **Training Hardware:** GPU (Tesla T4)
- **Evaluation Metric:** ROUGE Score
## Installation & Setup
### 1️⃣ Install Dependencies
Ensure you have `transformers`, `torch`, and `datasets` installed:
```bash
pip install transformers torch datasets
```
### 2️⃣ Load the Model
To load and use the fine-tuned model:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "songhieng/khmer-mt5-summarization-1024tk"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```
## How to Use
### 1️⃣ Using Python Code
```python
def summarize_khmer(text, max_length=150):
input_text = f"summarize: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarize_khmer(khmer_text)
print("Khmer Summary:", summary)
```
### 2️⃣ Using Hugging Face Pipeline
For a simpler approach:
```python
from transformers import pipeline
summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk")
khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("Khmer Summary:", summary[0]['summary_text'])
```
### 3️⃣ Deploy as an API using FastAPI
You can create a simple API for summarization:
```python
from fastapi import FastAPI
app = FastAPI()
@app.post("/summarize/")
def summarize(text: str):
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return {"summary": summary}
# Run with: uvicorn filename:app --reload
```
## Model Evaluation
The model was evaluated using **ROUGE scores**, which measure the similarity between the generated summaries and the reference summaries.
```python
from datasets import load_metric
rouge = load_metric("rouge")
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
return rouge.compute(predictions=decoded_preds, references=decoded_labels)
trainer.evaluate()
```
## Saving & Uploading the Model
After fine-tuning, the model can be uploaded to the Hugging Face Hub:
```python
model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk")
```
To download it later:
```python
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk")
```
## Summary
| **Feature** | **Details** |
|-----------------------|-------------------------------------------------|
| **Base Model** | `google/mt5-small` |
| **Task** | Summarization |
| **Language** | Khmer (αžαŸ’αž˜αŸ‚αžš) |
| **Dataset** | `kimleang123/khmer-text-dataset` |
| **Framework** | Hugging Face Transformers |
| **Evaluation Metric** | ROUGE Score |
| **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code |
## Contributing
Contributions are welcome! Feel free to **open issues or submit pull requests** if you have any improvements or suggestions.
### Contact
If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository.
**Built for the Khmer NLP Community**