|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- kimleang123/rfi_news |
|
language: |
|
- km |
|
metrics: |
|
- rouge |
|
base_model: |
|
- google/mt5-small |
|
pipeline_tag: summarization |
|
library_name: transformers |
|
--- |
|
# Khmer mT5 Summarization Model (1024 Tokens) |
|
|
|
## Introduction |
|
|
|
This repository contains a fine-tuned mT5 model for Khmer text summarization, extending the capabilities of the original [khmer-mt5-summarization](https://huggingface.co/songhieng/khmer-mt5-summarization) model. The primary enhancement in this version is the support for summarizing longer texts, with training adjusted to accommodate inputs up to 1024 tokens. |
|
|
|
## Model Details |
|
|
|
- **Base Model:** `google/mt5-small` |
|
- **Fine-tuned for:** Khmer text summarization with extended input length |
|
- **Training Dataset:** `kimleang123/khmer-text-dataset` |
|
- **Framework:** Hugging Face `transformers` |
|
- **Task Type:** Sequence-to-Sequence (Seq2Seq) |
|
- **Input:** Khmer text (articles, paragraphs, or documents) up to 1024 tokens |
|
- **Output:** Summarized Khmer text |
|
- **Training Hardware:** GPU (Tesla T4) |
|
- **Evaluation Metric:** ROUGE Score |
|
|
|
## Installation & Setup |
|
|
|
### 1οΈβ£ Install Dependencies |
|
|
|
Ensure you have `transformers`, `torch`, and `datasets` installed: |
|
|
|
```bash |
|
pip install transformers torch datasets |
|
``` |
|
|
|
### 2οΈβ£ Load the Model |
|
|
|
To load and use the fine-tuned model: |
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
model_name = "songhieng/khmer-mt5-summarization-1024tk" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
``` |
|
|
|
## How to Use |
|
|
|
### 1οΈβ£ Using Python Code |
|
|
|
```python |
|
def summarize_khmer(text, max_length=150): |
|
input_text = f"summarize: {text}" |
|
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024) |
|
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True) |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
return summary |
|
|
|
khmer_text = "ααααα»ααΆααΆααααααΆαααααααΆα α‘α¦ ααΆαααΆαα α αΎαααΆααΊααΆαααααααα
αααααα’αΆαααΈα’αΆααααααα" |
|
summary = summarize_khmer(khmer_text) |
|
print("Khmer Summary:", summary) |
|
``` |
|
|
|
### 2οΈβ£ Using Hugging Face Pipeline |
|
|
|
For a simpler approach: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk") |
|
khmer_text = "ααααα»ααΆααΆααααααΆαααααααΆα α‘α¦ ααΆαααΆαα α αΎαααΆααΊααΆαααααααα
αααααα’αΆαααΈα’αΆααααααα" |
|
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False) |
|
print("Khmer Summary:", summary[0]['summary_text']) |
|
``` |
|
|
|
### 3οΈβ£ Deploy as an API using FastAPI |
|
|
|
You can create a simple API for summarization: |
|
|
|
```python |
|
from fastapi import FastAPI |
|
|
|
app = FastAPI() |
|
|
|
@app.post("/summarize/") |
|
def summarize(text: str): |
|
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024) |
|
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True) |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
return {"summary": summary} |
|
|
|
# Run with: uvicorn filename:app --reload |
|
``` |
|
|
|
## Model Evaluation |
|
|
|
The model was evaluated using **ROUGE scores**, which measure the similarity between the generated summaries and the reference summaries. |
|
|
|
```python |
|
from datasets import load_metric |
|
|
|
rouge = load_metric("rouge") |
|
|
|
def compute_metrics(pred): |
|
labels_ids = pred.label_ids |
|
pred_ids = pred.predictions |
|
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) |
|
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True) |
|
return rouge.compute(predictions=decoded_preds, references=decoded_labels) |
|
|
|
trainer.evaluate() |
|
``` |
|
|
|
## Saving & Uploading the Model |
|
|
|
After fine-tuning, the model can be uploaded to the Hugging Face Hub: |
|
|
|
```python |
|
model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk") |
|
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk") |
|
``` |
|
|
|
To download it later: |
|
|
|
```python |
|
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk") |
|
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk") |
|
``` |
|
|
|
## Summary |
|
|
|
| **Feature** | **Details** | |
|
|-----------------------|-------------------------------------------------| |
|
| **Base Model** | `google/mt5-small` | |
|
| **Task** | Summarization | |
|
| **Language** | Khmer (ααααα) | |
|
| **Dataset** | `kimleang123/khmer-text-dataset` | |
|
| **Framework** | Hugging Face Transformers | |
|
| **Evaluation Metric** | ROUGE Score | |
|
| **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code | |
|
|
|
## Contributing |
|
|
|
Contributions are welcome! Feel free to **open issues or submit pull requests** if you have any improvements or suggestions. |
|
|
|
### Contact |
|
|
|
If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository. |
|
|
|
**Built for the Khmer NLP Community** |