Update README.md

06de78a verified 5 months ago

3.91 kB

	---
	tags:
	- model
	- checkpoints
	- translation
	- latin
	- english
	- mt5
	- mistral
	- multilingual
	- NLP
	language:
	- en
	- la
	license: "cc-by-4.0"
	models:
	- mistralai/Mistral-7B-Instruct-v0.3
	- google/mt5-small
	model_type: "mt5-small"
	training_epochs: 6 (initial pipeline), 30 (final pipeline with optimizations), 100 (fine-tuning on 4750 summaries)
	task_categories:
	- translation
	- summarization
	- multilingual-nlp
	task_ids:
	- en-la-translation
	- la-en-translation
	- text-generation
	pretty_name: "mT5-LatinSummarizerModel"
	storage:
	- git-lfs
	- huggingface-models
	size_categories:
	- 5GB<n<10GB
	---
	# mT5-LatinSummarizerModel: Fine-Tuned Model for Latin NLP

	[![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/AxelDlv00/LatinSummarizer)
	[![Hugging Face Model](https://img.shields.io/badge/Hugging%20Face-Model-blue?logo=huggingface)](https://huggingface.co/LatinNLP/LatinSummarizerModel)
	[![Hugging Face Dataset](https://img.shields.io/badge/Hugging%20Face-Dataset-orange?logo=huggingface)](https://huggingface.co/datasets/LatinNLP/LatinSummarizerDataset)

	## Overview
	This repository contains the trained checkpoints and tokenizer files for the `mT5-LatinSummarizerModel`, which was fine-tuned to improve Latin summarization and translation. It is designed to:
	- Translate between English and Latin.
	- Summarize Latin texts effectively.
	- Leverage extractive and abstractive summarization techniques.
	- Utilize curriculum learning for improved training.

	## Installation & Usage
	To download and set up the models (mT5-small and Mistral-7B-Instruct), you can directly run:
	```bash
	bash install_large_models.sh
	```

	## Project Structure
	```
	.
	├── final_pipeline (Trained for 30 light epochs with optimizations, and then finetuned on 100 on the small HQ summaries dataset)
	│ ├── no_stanza
	│ ├── with_stanza
	├── initial_pipeline (Trained for 6 epochs without optimizations)
	│ ├── mt5-small-en-la-translation-epoch5
	├── install_large_models.sh
	└── README.md
	```

	## Training Methodology
	We fine-tuned mT5-small in three phases:
	1. Initial Training Pipeline (6 epochs): Used the full dataset without optimizations.
	2. Final Training Pipeline (30 light epochs): Used 10% of training data per epoch for efficiency.
	3. Fine-Tuning (100 epochs): Focused on the 4750 high-quality summaries for final optimization.

	#### Training Configurations:
	- Hardware: 16GB VRAM GPU (lab machines via SSH).
	- Batch Size: Adaptive due to GPU memory constraints.
	- Gradient Accumulation: Enabled for larger effective batch sizes.
	- LoRA-based fine-tuning: LoRA Rank 8, Scaling Factor 32.
	- Dynamic Sequence Length Adjustment: Increased progressively.
	- Learning Rate: `5 × 10^-4` with warm-up steps.
	- Checkpointing: Frequent saves to mitigate power outages.

	## Evaluation & Results
	We evaluated the model using ROUGE, BERTScore, and BLEU/chrF scores.

	\| Metric \| Before Fine-Tuning \| After Fine-Tuning \|
	\|--------\|-----------------\|-----------------\|
	\| ROUGE-1 \| 0.1675 \| 0.2541 \|
	\| ROUGE-2 \| 0.0427 \| 0.0773 \|
	\| ROUGE-L \| 0.1459 \| 0.2139 \|
	\| BERTScore-F1 \| 0.6573 \| 0.7140 \|

	- chrF Score (en→la): 33.60 (with Stanza tags) vs 18.03 BLEU (without Stanza).
	- Summarization Density: Maintained at ~6%.

	### Observations:
	- Pre-training on extractive summaries was crucial.
	- The model retained some excessive extraction, indicating room for further improvement.

	## License
	This model is released under CC-BY-4.0.

	## Citation
	```bibtex
	@misc{LatinSummarizerModel,
	author = {Axel Delaval, Elsa Lubek},
	title = {Latin-English Summarization Model (mT5)},
	year = {2025},
	url = {https://huggingface.co/LatinNLP/LatinSummarizerModel}
	}
	```