# Khmer mT5 Summarization Model (1024 Tokens) - V2 ## Introduction This repository contains an improved version of the Khmer mT5 summarization model, **songhieng/khmer-mt5-summarization-1024tk-V2**. This version has been trained on an expanded dataset, including data from [kimleang123/rfi_news](https://huggingface.co/datasets/kimleang123/rfi_news), allowing for improved summarization performance on Khmer text. ## Model Details - **Base Model:** `google/mt5-small` - **Fine-tuned for:** Khmer text summarization with extended input length - **Training Dataset:** `kimleang123/rfi_news` + previous dataset - **Framework:** Hugging Face `transformers` - **Task Type:** Sequence-to-Sequence (Seq2Seq) - **Input:** Khmer text (articles, paragraphs, or documents) up to 1024 tokens - **Output:** Summarized Khmer text - **Training Hardware:** GPU (Tesla T4) - **Evaluation Metric:** ROUGE Score ## Installation & Setup ### 1️⃣ Install Dependencies Ensure you have `transformers`, `torch`, and `datasets` installed: ```bash pip install transformers torch datasets ``` ### 2️⃣ Load the Model To load and use the fine-tuned model: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "songhieng/khmer-mt5-summarization-1024tk-V2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) ``` ## How to Use ### 1️⃣ Using Python Code ```python def summarize_khmer(text, max_length=150): input_text = f"summarize: {text}" inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024) summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return summary khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។" summary = summarize_khmer(khmer_text) print("Khmer Summary:", summary) ``` ### 2️⃣ Using Hugging Face Pipeline ```python from transformers import pipeline summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk-V2") khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។" summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False) print("Khmer Summary:", summary[0]['summary_text']) ``` ### 3️⃣ Deploy as an API using FastAPI ```python from fastapi import FastAPI app = FastAPI() @app.post("/summarize/") def summarize(text: str): inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024) summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return {"summary": summary} # Run with: uvicorn filename:app --reload ``` ## Model Evaluation The model was evaluated using **ROUGE scores**, which measure the similarity between the generated summaries and the reference summaries. ```python from datasets import load_metric rouge = load_metric("rouge") def compute_metrics(pred): labels_ids = pred.label_ids pred_ids = pred.predictions decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True) return rouge.compute(predictions=decoded_preds, references=decoded_labels) trainer.evaluate() ``` ## Saving & Uploading the Model After fine-tuning, the model can be uploaded to the Hugging Face Hub: ```python model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2") tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk-V2") ``` To download it later: ```python model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2") tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk-V2") ``` ## Summary | **Feature** | **Details** | |-----------------------|-------------------------------------------------| | **Base Model** | `google/mt5-small` | | **Task** | Summarization | | **Language** | Khmer (ខ្មែរ) | | **Dataset** | `kimleang123/rfi_news` + previous dataset | | **Framework** | Hugging Face Transformers | | **Evaluation Metric** | ROUGE Score | | **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code | ## Contributing Contributions are welcome! Feel free to **open issues or submit pull requests** if you have any improvements or suggestions. ### Contact If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository. **Built for the Khmer NLP Community**