File size: 3,467 Bytes
a0cc5e5
 
 
 
 
 
 
 
0784f3c
 
 
a5ac8a5
a0cc5e5
0784f3c
a0cc5e5
7b339cb
0784f3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a0cc5e5
 
 
0784f3c
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: apache-2.0
language:
- en
- la
base_model:
- google/mt5-small
---
Demonstration of fine-tuning of mt5-small for C17th English (and Latin) legal depositions.
Uses mt5-small, which is trained on the mC4 common crawal dataset containing 101 languages, including some Latin.
mt5-small is the smallest of five variants of mt5 (small; base; large; XL; XXL).
Fine-tuned with text to text pairs of raw-HTR and hand-corrected Ground Truth from C17th English High Court of Admiralty depositions.

A series of fine-tuned mt5-small models will be created with ascending version numbers.

Training dataset = 80%; validation dataset = 20%.
mt5Tokenizer.
PyTorch datasets.
T5ForConditionalGeneration model.
CER/WER evaluation; Qualitative evaluation (e.g. capitalisation; HTR error correction).
Train using Nvidia T4 small 15 GB $0.40/hour.

MT5TOKENIZER
Python

from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")

TOKENIZE DATA
Python

train_encodings = tokenizer(list(train_inputs), text_target=list(train_targets), truncation=True, padding=True)
val_encodings = tokenizer(list(val_inputs), text_target=list(val_targets), truncation=True, padding=True)

CREATE PYTORCH DATASETS
Python

import torch

class HTRDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)  



train_dataset = HTRDataset(train_encodings)
val_dataset = HTRDataset(val_encodings)

FINE-TUNING WITH TRANSFORMERS
Python

from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("google/mt5-small")

TRAINING ARGUMENTS:
python

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,  # Or 16 if your GPU has enough memory
    per_device_eval_batch_size=8,   # Same as train batch size
    learning_rate=1e-4,
    num_train_epochs=3,             # Or 5
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    fp16=True,                      # If your GPU supports it, for faster training
    # ... other arguments ...
)

EARLY STOPPING:
python

training_args = TrainingArguments(
    # ... other arguments ...
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss", 
    early_stopping_patience=3  # Optional
)

CREATE TRAINER AND FINE-TUNE
Python

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset  


)

trainer.train()   

---
Fine-tuning data experiments will include:

* Using 1000 lines of raw-HTR paired with 1000 lines of hand corrected Ground Truth
* Using 2000 lines of raw-HTR paired with 1000 lines of hand corrected Ground Truth
* Using 1000 and 2000 lines of synthetic raw-HTR paired with 1000 lines of handcorrected Ground Truth
---
Hyper-parameter experients will include:

* Adjusting batch size from 8 paired-lines to 16 paired-lines
* Adjusting epochs from 3 to 5 epochs
* Adjusting learning rate
** Start with a learning rate of 1e-4 (0.0001). This is a common starting point for fine-tuning transformer models.
** Experiment with slightly higher or lower values (e.g., 5e-4 or 5e-5) in later experiments
* Adjusting earlystopping settings