File size: 3,467 Bytes
a0cc5e5 0784f3c a5ac8a5 a0cc5e5 0784f3c a0cc5e5 7b339cb 0784f3c a0cc5e5 0784f3c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
license: apache-2.0
language:
- en
- la
base_model:
- google/mt5-small
---
Demonstration of fine-tuning of mt5-small for C17th English (and Latin) legal depositions.
Uses mt5-small, which is trained on the mC4 common crawal dataset containing 101 languages, including some Latin.
mt5-small is the smallest of five variants of mt5 (small; base; large; XL; XXL).
Fine-tuned with text to text pairs of raw-HTR and hand-corrected Ground Truth from C17th English High Court of Admiralty depositions.
A series of fine-tuned mt5-small models will be created with ascending version numbers.
Training dataset = 80%; validation dataset = 20%.
mt5Tokenizer.
PyTorch datasets.
T5ForConditionalGeneration model.
CER/WER evaluation; Qualitative evaluation (e.g. capitalisation; HTR error correction).
Train using Nvidia T4 small 15 GB $0.40/hour.
MT5TOKENIZER
Python
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
TOKENIZE DATA
Python
train_encodings = tokenizer(list(train_inputs), text_target=list(train_targets), truncation=True, padding=True)
val_encodings = tokenizer(list(val_inputs), text_target=list(val_targets), truncation=True, padding=True)
CREATE PYTORCH DATASETS
Python
import torch
class HTRDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings.input_ids)
train_dataset = HTRDataset(train_encodings)
val_dataset = HTRDataset(val_encodings)
FINE-TUNING WITH TRANSFORMERS
Python
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("google/mt5-small")
TRAINING ARGUMENTS:
python
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=8, # Or 16 if your GPU has enough memory
per_device_eval_batch_size=8, # Same as train batch size
learning_rate=1e-4,
num_train_epochs=3, # Or 5
evaluation_strategy="epoch",
save_strategy="epoch",
fp16=True, # If your GPU supports it, for faster training
# ... other arguments ...
)
EARLY STOPPING:
python
training_args = TrainingArguments(
# ... other arguments ...
evaluation_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
early_stopping_patience=3 # Optional
)
CREATE TRAINER AND FINE-TUNE
Python
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
trainer.train()
---
Fine-tuning data experiments will include:
* Using 1000 lines of raw-HTR paired with 1000 lines of hand corrected Ground Truth
* Using 2000 lines of raw-HTR paired with 1000 lines of hand corrected Ground Truth
* Using 1000 and 2000 lines of synthetic raw-HTR paired with 1000 lines of handcorrected Ground Truth
---
Hyper-parameter experients will include:
* Adjusting batch size from 8 paired-lines to 16 paired-lines
* Adjusting epochs from 3 to 5 epochs
* Adjusting learning rate
** Start with a learning rate of 1e-4 (0.0001). This is a common starting point for fine-tuning transformer models.
** Experiment with slightly higher or lower values (e.g., 5e-4 or 5e-5) in later experiments
* Adjusting earlystopping settings
|