# Replaced with v1.4.2 at https://huggingface.co/MartialTerran/Toy_GPTs_LLMs_for_CPU_Educational/blob/main/Gettysburg_GPT2_v1.4.2.py #This script runs and computes loss down to under 0.001 at epoch 101, then after epoch 110 the loss rises up again. Then at epoch 150 the loss goes downward again. Next version will report the particular words that are causing the error/loss. # # The tokenize method now uses the last special token in the self.special_tokens list (which is assumed to be the padding token in this case) as the default token for unknown words. #text separate_punctuation focuses solely on separating the defined punctuation marks from words. #Carriage returns [unntested] are treated as a distinct case and are replaced with the token after a punctuation-separate step. # The detokenizer does not yet auto-remove spaces preceding punctuations. This is because tokens are defined without leading spaces, and spaces are autoappended to all tokens in detokenizer. # It's possible to increase training_input_seq_len over epochs. However, directly modifying training_input_seq_len inside the Dataset class after it's created isn't ideal. A better approach is to control the sequence length during batch creation within the DataLoader. You can achieve this using a custom collate_fn ?