khmer-mt5-summarization-1024tk / README.md

Update README.md

a1bdd8d verified 7 days ago

5.53 kB

	---
	license: apache-2.0
	datasets:
	- kimleang123/rfi_news
	language:
	- km
	metrics:
	- rouge
	base_model:
	- google/mt5-small
	pipeline_tag: summarization
	library_name: transformers
	---
	# Khmer mT5 Summarization Model (1024 Tokens)

	## Introduction

	This repository contains a fine-tuned mT5 model for Khmer text summarization, extending the capabilities of the original [khmer-mt5-summarization](https://huggingface.co/songhieng/khmer-mt5-summarization) model. The primary enhancement in this version is the support for summarizing longer texts, with training adjusted to accommodate inputs up to 1024 tokens.

	## Model Details

	- Base Model: `google/mt5-small`
	- Fine-tuned for: Khmer text summarization with extended input length
	- Training Dataset: `kimleang123/khmer-text-dataset`
	- Framework: Hugging Face `transformers`
	- Task Type: Sequence-to-Sequence (Seq2Seq)
	- Input: Khmer text (articles, paragraphs, or documents) up to 1024 tokens
	- Output: Summarized Khmer text
	- Training Hardware: GPU (Tesla T4)
	- Evaluation Metric: ROUGE Score

	## Installation & Setup

	### 1️⃣ Install Dependencies

	Ensure you have `transformers`, `torch`, and `datasets` installed:

	```bash
	pip install transformers torch datasets
	```

	### 2️⃣ Load the Model

	To load and use the fine-tuned model:

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model_name = "songhieng/khmer-mt5-summarization-1024tk"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
	```

	## How to Use

	### 1️⃣ Using Python Code

	```python
	def summarize_khmer(text, max_length=150):
	input_text = f"summarize: {text}"
	inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)
	summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
	summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	return summary

	khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
	summary = summarize_khmer(khmer_text)
	print("Khmer Summary:", summary)
	```

	### 2️⃣ Using Hugging Face Pipeline

	For a simpler approach:

	```python
	from transformers import pipeline

	summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization-1024tk")
	khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
	summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
	print("Khmer Summary:", summary[0]['summary_text'])
	```

	### 3️⃣ Deploy as an API using FastAPI

	You can create a simple API for summarization:

	```python
	from fastapi import FastAPI

	app = FastAPI()

	@app.post("/summarize/")
	def summarize(text: str):
	inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=1024)
	summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
	summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	return {"summary": summary}

	# Run with: uvicorn filename:app --reload
	```

	## Model Evaluation

	The model was evaluated using ROUGE scores, which measure the similarity between the generated summaries and the reference summaries.

	```python
	from datasets import load_metric

	rouge = load_metric("rouge")

	def compute_metrics(pred):
	labels_ids = pred.label_ids
	pred_ids = pred.predictions
	decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
	decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
	return rouge.compute(predictions=decoded_preds, references=decoded_labels)

	trainer.evaluate()
	```

	## Saving & Uploading the Model

	After fine-tuning, the model can be uploaded to the Hugging Face Hub:

	```python
	model.push_to_hub("songhieng/khmer-mt5-summarization-1024tk")
	tokenizer.push_to_hub("songhieng/khmer-mt5-summarization-1024tk")
	```

	To download it later:

	```python
	model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization-1024tk")
	tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization-1024tk")
	```

	## Summary

	\| Feature \| Details \|
	\|-----------------------\|-------------------------------------------------\|
	\| Base Model \| `google/mt5-small` \|
	\| Task \| Summarization \|
	\| Language \| Khmer (ខ្មែរ) \|
	\| Dataset \| `kimleang123/khmer-text-dataset` \|
	\| Framework \| Hugging Face Transformers \|
	\| Evaluation Metric \| ROUGE Score \|
	\| Deployment \| Hugging Face Model Hub, API (FastAPI), Python Code \|

	## Contributing

	Contributions are welcome! Feel free to open issues or submit pull requests if you have any improvements or suggestions.

	### Contact

	If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository.

	Built for the Khmer NLP Community