MartialTerran commited on
Commit
c8f2523
1 Parent(s): aa29bb9

Update Gettysburg_GPT2_v1.4.2.py

Browse files
Files changed (1) hide show
  1. Gettysburg_GPT2_v1.4.2.py +492 -0
Gettysburg_GPT2_v1.4.2.py CHANGED
@@ -0,0 +1,492 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #Note: I have realized that the "Positional Encoding" used in this toy model is a pytorch-supported "learned positional encoding" not the true "sinusoidal encoding". self.position_embedding_table = nn.Embedding(config["max_sequence_len"], config["n_embd"])
2
+ # Thus, although functional for memorizing the sequence of tokens in the Dataset, it is not a fully equivalent to a GPT-2 model.
3
+
4
+ """
5
+ To implment a sinusoidal positional encoding, you can use:
6
+
7
+ class ToyGPT2(nn.Module):
8
+ self.sinusoidal_embedding_table = self._create_sinusoidal_embeddings(self.max_sequence_length, self.embedding_dimension)
9
+
10
+ def _create_sinusoidal_embeddings(self, max_len, dim):
11
+ position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
12
+ div_term = torch.exp(torch.arange(0, dim, 2).float() * (-math.log(10000.0) / dim))
13
+ pe = torch.zeros(max_len, dim)
14
+ pe[:, 0::2] = torch.sin(position * div_term)
15
+ pe[:, 1::2] = torch.cos(position * div_term)
16
+ pe = pe.unsqueeze(0) # Add a batch dimension (size 1)
17
+ return pe
18
+ #return pe.to(device) # Add this line to move to the device
19
+
20
+ def forward(self, idx, targets=None):
21
+ #positional_embeddings = learned_embeddings
22
+ sinusoidal_embeddings = self.sinusoidal_embedding_table[:, :T, :].to(device) # (T,C)
23
+ positional_embeddings = sinusoidal_embeddings
24
+ x = tok_emb + positional_embeddings # (B,T,C)
25
+ """
26
+
27
+ # Added into v1.4.2.py
28
+
29
+ # Dynamic vocab_size Update: After the Tokenizer is created, in the def Main, the calculated tokenizer.vocab_size is assigned back to hyperparameters["vocab_size"]: hyperparameters["vocab_size"] = tokenizer.vocab_size This guarantees that the ToyGPT2 model is constructed before training with the correct vocabulary size derived from the Dataset (Gettysburg Address) plus special tokens. This helps prevent runtime errors when user makes a change to the Dataset resulting in a different number of tokens in the model vocabulary. [If the hyperparameters["vocab_size"] is any larger than the number of Tokenizer-defined tokens, the as-initialized output level might select an undefined token as the next-token and then produce an out-of-range crash. Checkpoints (when enabled within the def Training) will be saved to include the updated hyperparameters["vocab_size"]. A future feature, not yet implemented, should include automatically writing the hyperparameters["vocab_size"] to the filename of the saved checkpoint.
30
+
31
+ # Bug fix. Learning Rate Access Error Fix. Encountered on some CUDA computers:
32
+ # added last_lr = self.optimizer.param_groups[0]['lr'] # Get the learning rate from the optimizer
33
+ # replacing last_lr = self.scheduler.get_last_lr()[0] # Get the last learning rate
34
+
35
+ # Note: If the token length of the Training Dataset is changed (even if the hyperparameters["vocab_size"] is not changed), the values for ["max_sequence_len": 264, # Maximum sequence length] and [min_training_input_seq_len = 32] may need to be changed to avoid a crash. Generally, the Dataset has to have at least about 20 more tokens than the [min_training_input_seq_len = 32]. This has something to do with how the dataloader defines batches. It is fastest and probably most efficient to train the model with ["batch_size": 1] and only provide enough tokens in the dataset to define one unique batch?
36
+
37
+ # The ["max_sequence_len": 264] Hyperparameter only truncates input token (prompt) sequences. Despite the ["max_sequence_len": 264] the model will still output a larger number of tokens in its whole response. This is a still-mysterious feature of the model python code. A max-new-tokens limit has not been implemented. a <end of doc> special token has not been defined.
38
+
39
+ # Features of v1.4final.py
40
+ # This script runs and computes loss down to under 0.001 at epoch 101, then after epoch 110 the loss rises up again. Then at epoch 150 the loss goes downward again. Next version will report the particular words that are causing the error/loss.
41
+ # # The tokenize method now uses the last special token in the self.special_tokens list (which is assumed to be the padding token <pad> in this case) as the default token for unknown words.
42
+ #text separate_punctuation focuses solely on separating the defined punctuation marks from words.
43
+
44
+ #Carriage returns [unntested] are treated as a distinct case and are replaced with the <cr> token after a punctuation-separate step.
45
+ # The detokenizer does not yet auto-remove spaces preceding punctuations. This is because tokens are defined without leading spaces, and spaces are autoappended to all tokens in detokenizer.
46
+
47
+ # It's possible to increase training_input_seq_len over epochs. However, directly modifying training_input_seq_len inside the Dataset class after it's created isn't ideal. A better approach is to control the sequence length during batch creation within the DataLoader. You can achieve this using a custom collate_fn ?
48
+
49
+
50
+ print("loading libraries")
51
+ import os # to get filename of this script
52
+ import datetime
53
+ import torch
54
+ import torch.nn as nn
55
+ import torch.optim as optim
56
+ from torch.utils.data import Dataset, DataLoader
57
+ import torch.optim as optim
58
+ from torch.optim.lr_scheduler import ReduceLROnPlateau # Import the learning rate scheduler
59
+ import math
60
+ import inspect
61
+ #import string # replaced with self.punctuation_list = ['.', ',', '/', '\\', '[', ']', '<', '?', '>', '-']] # Specific list of punctuations
62
+ print("done loading libraries")
63
+
64
+ print("Hardcoding Memorized_Speech = Gettysburg Address") #(for simplicity in this toy example)
65
+ Memorized_Speech1 = """
66
+ Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
67
+
68
+ Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
69
+
70
+ But, in a larger sense, we can not dedicate - we can not consecrate - we can not hallow-this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth.
71
+
72
+ Apple blossom cantaloupe durian elderberry fig guava honeydew iguana kiwi lime mango nectarine orange papaya quince rambutan strawberry tangerine uglier vanilla watermelon xigua yellow yumberry zebra.
73
+ """
74
+ Memorized_Speech = """
75
+ Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war testing whether that nation or any nation so conceived and so dedicated can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But in a larger sense we can not dedicate we can not consecrate we can not hallow-this ground. The brave men living and dead who struggled here have consecrated it far above our poor power to add or detract. The world will little note nor long remember what we say here but it can never forget what they did here. It is for us the living rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion that we here highly resolve that these dead shall not have died in vain that this nation under God shall have a new birth of freedom and that government of the people by the people for the people shall not perish from the earth.
76
+
77
+ Apple blossom cantaloupe durian elderberry fig guava honeydew iguana kiwi lime mango nectarine orange papaya quince rambutan strawberry tangerine uglier vanilla watermelon xigua yellow yumberry zebra.
78
+ """
79
+
80
+ print(f'Length of Memorized_Speech = {len(Memorized_Speech)} characters, as follows:')
81
+ print(Memorized_Speech)
82
+
83
+ # Add special tokens here. "<pad>" is also used for unknown words. The carriage-return specialtoken will be auto-inserted into the received text before tokenization. But tabs and newlines are not implemented/supported.
84
+ # Hyperparameters
85
+ hyperparameters = {
86
+ "vocab_size": 152, # Estimated vocabulary size for Gettysburg Address + special tokens
87
+ "special_tokens": ["<FreetheLLM>", "<cr>", "<pad>"],
88
+ "n_embd": 4, # Embedding dimension
89
+ "n_layer": 1, # Number of layers
90
+ "n_head": 1, # Number of attention heads
91
+ "n_inner": 4 * 16, # Inner dimension of feedforward network (4 times n_embd)
92
+ "max_sequence_len": 340, # Maximum sequence length
93
+ "epochs": 100000, # Number of training epochs
94
+ "learning_rate": 1e-3, # [Initial] Learning rate
95
+ "batch_size": 16, # Batch size (since the dataset is small)
96
+ "dropout": 0.2 # Dropout probability
97
+ }
98
+ # More Script/Training parameters:
99
+ min_training_input_seq_len = 300
100
+ Early_stopping_loss = 0.001
101
+
102
+
103
+ def print_with_line(message):
104
+ frame = inspect.currentframe().f_back # needs import inspect
105
+ line_number = frame.f_lineno
106
+ print(f"{message} at script line {line_number}")
107
+
108
+ # --- Tokenizer and Detokenizer ---
109
+ class Tokenizer:
110
+ def __init__(self, text, special_tokens, vocab_size_hyperparameter):
111
+ self.special_tokens = special_tokens
112
+ self.cr_token = special_tokens[1]
113
+ #self.punctuation = string.punctuation # Store punctuation characters
114
+ self.punctuation_list = ['.', ',', '/', '\\', '[', ']', '<', '?', '>', '-'] # Specific list of punctuations
115
+ estimated_vocab_size = vocab_size_hyperparameter #hyperparameters["vocab_size"]
116
+
117
+ # Preprocess text to separate existing punctuation from words, and then auto-inserts <cr> special tokens at carriage returns.
118
+ text = self.separate_punctuation(text)
119
+
120
+ in_text_words = []
121
+ in_text_punctuations = []
122
+ for candidate in text.split(): # Split into tokens (space-separated words and punctuation; includes words attached to punctuation)
123
+ cleaned_words = ''.join(c for c in candidate if c not in self.punctuation_list) #strip punctuation from words
124
+ if cleaned_words:
125
+ in_text_words.append(cleaned_words.lower())
126
+ for char in candidate: # Iterate through each character in the candidates
127
+ if char in self.punctuation_list:
128
+ in_text_punctuations.append(char) # Add in-text punctuation as separate tokens
129
+
130
+ # Ensure unique and sorted word and punctuation tokens
131
+ in_text_words = list(set(in_text_words))
132
+ in_text_words.sort()
133
+ in_text_punctuations = list(set(in_text_punctuations))
134
+ in_text_punctuations.sort()
135
+
136
+ self.vocab = self.special_tokens + in_text_punctuations + in_text_words # Vocab starts with special tokens, then punctuation, then whole words.
137
+ self.vocab_size = len(self.vocab) # Calculate vocabulary size dynamically
138
+ # Alert if vocab_size is different from a predefined hyperparameter estimate (optional)
139
+ if self.vocab_size != estimated_vocab_size:
140
+ print(f"Warning: Calculated vocab_size ({self.vocab_size}) differs from estimated size ({estimated_vocab_size}).")
141
+
142
+ self.word_to_index = {word: i for i, word in enumerate(self.vocab)}
143
+ self.index_to_word = {i: word for i, word in enumerate(self.vocab)}
144
+
145
+ def separate_punctuation(self, text): # text passed to the tokenize method is also preprocessed to have separated punctuation before tokenization #separate_punctuation(self, text) method, as currently implemented, does not directly affect carriage returns (\r) in the original text.
146
+ #Adds spaces around punctuation to separate them from words.
147
+ for char in self.punctuation_list:
148
+ text = text.replace(char, f' {char} ')
149
+ #Replace carriage returns (backslash-r) in the input text with a special token (e.g., <cr>).
150
+ text = text.replace('\r', f' {self.cr_token} ') # Replace \r with <cr> token and pad with spaces.
151
+ #print(f"Carriage-Return's special token inserted as {self.cr_token}")
152
+ return text
153
+
154
+
155
+ def tokenize(self, text):
156
+ # Apply punctuation separation before tokenizing
157
+ text = self.separate_punctuation(text)
158
+ words = text.lower().split() #preserves special tokens like the auto-inserted <cr>
159
+ token_ids = []
160
+ for word in words:
161
+ if word in self.word_to_index:
162
+ token_ids.append(self.word_to_index[word])
163
+ else:
164
+ #token_ids.append(self.word_to_index['<pad>'])
165
+ token_ids.append(self.word_to_index[self.special_tokens[-1]]) # Use last special token as default (e.g., <pad>) # The tokenize method now uses the last special token in the self.special_tokens list (which is assumed to be the padding token <pad> in this case) as the default token for unknown words.
166
+ return token_ids
167
+
168
+ def detokenize(self, tokens):
169
+ return " ".join([self.index_to_word[token] for token in tokens if token in self.index_to_word])
170
+
171
+ # --- GPT-2 Model ---
172
+ class CausalSelfAttention(nn.Module):
173
+ def __init__(self, config):
174
+ super().__init__()
175
+ assert config["n_embd"] % config["n_head"] == 0
176
+ # key, query, value projections for all heads, but in a batch
177
+ self.c_attn = nn.Linear(config["n_embd"], 3 * config["n_embd"])
178
+ # output projection
179
+ self.c_proj = nn.Linear(config["n_embd"], config["n_embd"])
180
+ # regularization
181
+ self.attn_dropout = nn.Dropout(0.1)
182
+ self.resid_dropout = nn.Dropout(0.1)
183
+ self.n_head = config["n_head"]
184
+ self.n_embd = config["n_embd"]
185
+ self.register_buffer("bias", torch.tril(torch.ones(config["max_sequence_len"], config["max_sequence_len"]))
186
+ .view(1, 1, config["max_sequence_len"], config["max_sequence_len"]))
187
+
188
+ def forward(self, x):
189
+ B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
190
+
191
+ # calculate query, key, values for all heads in batch and move head forward to be the batch dim
192
+ q, k ,v = self.c_attn(x).split(self.n_embd, dim=2)
193
+ k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
194
+ q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
195
+ v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
196
+
197
+ # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
198
+ att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
199
+ att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
200
+ att = torch.softmax(att, dim=-1)
201
+ att = self.attn_dropout(att)
202
+ y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
203
+ y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
204
+
205
+ # output projection
206
+ y = self.resid_dropout(self.c_proj(y))
207
+ return y
208
+
209
+ class Block(nn.Module):
210
+ def __init__(self, config):
211
+ super().__init__()
212
+ self.ln_1 = nn.LayerNorm(config["n_embd"])
213
+ self.attn = CausalSelfAttention(config)
214
+ self.ln_2 = nn.LayerNorm(config["n_embd"])
215
+ self.mlp = nn.Sequential(
216
+ nn.Linear(config["n_embd"], config["n_inner"]),
217
+ nn.GELU(),
218
+ nn.Linear(config["n_inner"], config["n_embd"]),
219
+ nn.Dropout(0.1),
220
+ )
221
+
222
+ def forward(self, x):
223
+ x = x + self.attn(self.ln_1(x))
224
+ x = x + self.mlp(self.ln_2(x))
225
+ return x
226
+
227
+ class ToyGPT2(nn.Module):
228
+ def __init__(self, config):
229
+ super().__init__()
230
+ self.config = config
231
+ self.token_embedding_table = nn.Embedding(config["vocab_size"], config["n_embd"])
232
+ self.position_embedding_table = nn.Embedding(config["max_sequence_len"], config["n_embd"])
233
+ self.blocks = nn.Sequential(*[Block(config) for _ in range(config["n_layer"])])
234
+ self.ln_f = nn.LayerNorm(config["n_embd"]) # final layer norm
235
+ self.lm_head = nn.Linear(config["n_embd"], config["vocab_size"])
236
+
237
+ # Initialize weights to be small for better training
238
+ self.apply(self._init_weights)
239
+
240
+ # Tie the weights of the embedding and the output layer
241
+ self.lm_head.weight = self.token_embedding_table.weight
242
+
243
+ def _init_weights(self, module):
244
+ #if isinstance(module, nn.Linear):
245
+ # torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
246
+ if isinstance(module, nn.Linear) and module.bias is not None:
247
+ #print("isinstance(module, nn.Linear) and module.bias is not None")
248
+ torch.nn.init.zeros_(module.bias)
249
+ elif isinstance(module, nn.Embedding):
250
+ torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
251
+ #print("torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)")
252
+
253
+ def forward(self, idx, targets=None):
254
+ B, T = idx.shape
255
+ # idx and targets are both (B,T) tensor of integers
256
+ tok_emb = self.token_embedding_table(idx) # (B,T,C)
257
+ pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T,C)
258
+ x = tok_emb + pos_emb # (B,T,C)
259
+ x = self.blocks(x) # (B,T,C)
260
+ x = self.ln_f(x) # (B,T,C)
261
+ logits = self.lm_head(x) # (B,T,vocab_size)
262
+
263
+ if targets is None:
264
+ loss = None
265
+ else:
266
+ B, T, C = logits.shape
267
+ logits = logits.view(B*T, C)
268
+ targets = targets.view(B*T)
269
+ loss = nn.functional.cross_entropy(logits, targets)
270
+
271
+ return logits, loss
272
+
273
+ def generate(self, input_ids, max_new_tokens, temperature=1.0):
274
+ self.eval() # Set model to evaluation mode
275
+ with torch.no_grad(): # Disable gradient calculation during generation
276
+ for _ in range(max_new_tokens):
277
+ # Limit input_ids to the last max_sequence_len tokens
278
+ input_ids_truncated = input_ids[:, -self.config["max_sequence_len"]:]
279
+
280
+ # Get logits from the model
281
+ logits, _ = self(input_ids_truncated) # No need for loss during generation
282
+
283
+ # Focus on the logits for the last time step (next token prediction)
284
+ logits = logits[:, -1, :] / temperature
285
+
286
+ # Apply softmax to get probabilities
287
+ probs = torch.softmax(logits, dim=-1)
288
+
289
+ # Sample the next token
290
+ next_token = torch.multinomial(probs, num_samples=1)
291
+
292
+
293
+ # Append next token to input sequence
294
+ input_ids = torch.cat((input_ids, next_token), dim=1)
295
+
296
+ self.train() # Return model to training mode
297
+ return input_ids
298
+
299
+ # --- Dataset ---
300
+ class Dataset(Dataset):
301
+ def __init__(self, data, tokenizer, seq_len):
302
+ self.tokenizer = tokenizer
303
+ self.seq_len = seq_len
304
+
305
+ print_with_line("# Tokenize the entire data")
306
+ self.tokens = self.tokenizer.tokenize(data)
307
+ print(f"DEBUG: Total tokens: {len(self.tokens)} in Dataset(") # Add this line
308
+
309
+ # Calculate token counts
310
+ self.token_counts = self._calculate_token_counts() # Store counts in the object
311
+
312
+ # Create input-target pairs
313
+ self.data = []
314
+ for i in range(0, len(self.tokens) - seq_len - 1, seq_len):
315
+ input_seq = self.tokens[i:i + seq_len]
316
+ target_seq = self.tokens[i + 1:i + seq_len + 1]
317
+ self.data.append((torch.tensor(input_seq), torch.tensor(target_seq)))
318
+
319
+ print(f"DEBUG Dataset(Dataset): Number of data samples created in class Dataset(Dataset): {len(self.data)}") # Add this line
320
+
321
+ # Print token-vocabulary information
322
+ print_with_line("# Print token-vocabulary information:")
323
+ self.print_vocabulary_info() # Call the new method
324
+
325
+ def _calculate_token_counts(self):
326
+ #Calculates the frequency of each token in self.tokens.
327
+ counts = {}
328
+ for token in self.tokens:
329
+ if token in counts:
330
+ counts[token] += 1
331
+ #print(f"token {token} count has been incremented to {counts[token]}")
332
+ else:
333
+ counts[token] = 1
334
+ return counts
335
+
336
+ def print_vocabulary_info(self):
337
+ print_with_line("# Print token-vocabulary information:")
338
+ for token_id in range(self.tokenizer.vocab_size): # Iterate through indices
339
+ token = self.tokenizer.index_to_word[token_id] # Get token string from index
340
+ count = self.token_counts.get(token_id, 0) # Correct: token_id is an integer ID # Get count, default to 0 if not found
341
+ #print(f" Token {token_id}: '{token}' occurs {count} times in the dataset")
342
+ print(f" Token {token_id}:'{token}' \t\t occurs {count} times in the dataset")
343
+
344
+ def __len__(self):
345
+ return len(self.data)
346
+
347
+ def __getitem__(self, idx):
348
+ return self.data[idx] # Return the pre-processed tensor pairs
349
+
350
+
351
+
352
+ # --- Trainer ---
353
+ class Trainer:
354
+ def __init__(self, model, tokenizer, train_loader, hyperparameters, device):
355
+ self.model = model
356
+ self.tokenizer = tokenizer
357
+ self.train_loader = train_loader # notice this change
358
+ self.hyperparameters = hyperparameters
359
+ self.Early_stopping_loss = Early_stopping_loss # Set Early stopping loss
360
+ self.device = device # Store the device
361
+
362
+ self.optimizer = optim.AdamW(self.model.parameters(), lr=hyperparameters["learning_rate"])
363
+ self.scheduler = ReduceLROnPlateau(self.optimizer, mode='min', factor=0.99, patience=100)
364
+ # mode='min': Indicates that you want to minimize the loss.
365
+ # factor=0.1: The factor by which the learning rate is reduced (e.g., 0.1 means reduce to 10%).
366
+ # patience=10: Number of epochs with no improvement after which the learning rate will be reduced.
367
+ # verbose=True: Prints a message when the learning rate is adjusted.
368
+ # Step the Scheduler: Call self.scheduler.step(average_loss) after calculating average_loss. This tells the scheduler to update the learning rate based on the current loss.
369
+ # Automated Adjustment: The scheduler automatically adjusts the learning rate, removing the need for manual tuning during training.
370
+ # Improved Convergence: Can help the model converge more smoothly and potentially reach a better solution.
371
+ # Reduced Fluctuations: Helps reduce the fluctuations in the loss.
372
+
373
+ def train(self):
374
+ self.model.train() # Set model to training mode
375
+ for epoch in range(self.hyperparameters["epochs"]):
376
+ last_lr = self.optimizer.param_groups[0]['lr'] # Get the learning rate from the optimizer
377
+ total_loss = 0
378
+ for batch_idx, (input_seq, target_seq) in enumerate(self.train_loader): # Use enumerate to get batch index # Directly use the loaded batches
379
+ input_seq = input_seq.to(self.device) # Move to device
380
+ target_seq = target_seq.to(self.device) # Move to device
381
+ self.optimizer.zero_grad()
382
+ logits, loss = self.model(input_seq, targets=target_seq) # logits are the raw predictions
383
+ loss.backward()
384
+ self.optimizer.step()
385
+ total_loss += loss.item()
386
+ average_loss = total_loss / len(self.train_loader) # Consider number of batches
387
+ print(f"Epoch {epoch+1}/{self.hyperparameters['epochs']}, Loss: {average_loss:.4f}")
388
+ if loss < 0.01: # Check loss for current batch
389
+ print(" LOSS IS BELOW 0.01")
390
+ if loss < 0.001: # Check loss for current batch
391
+ print(" LOSS IS BELOW 0.001")
392
+ self.scheduler.step(average_loss) # Update the lossrate-scheduler with the current loss
393
+ # Check if the learning rate has changed and print it
394
+ current_lr = self.optimizer.param_groups[0]['lr']
395
+ #last_lr = self.scheduler.get_last_lr()[0] # Get the last learning rate
396
+ if current_lr != last_lr:
397
+ print(f" Learning rate reduced to {current_lr:.6f}")
398
+ print(f"Epoch {epoch+1}/{self.hyperparameters['epochs']}, Loss: {average_loss:.4f}, Learning Rate: {current_lr:.6f}")
399
+ if(epoch%100 ==0):
400
+ current_lr = self.optimizer.param_groups[0]['lr'] # Get the current learning rate from the optimizer
401
+ print(f"Epoch {epoch + 1}: Current learning rate: {current_lr:.6f}") #current_lr Retrieval: Inside the if (epoch % 100 == 0) block, the current learning rate is obtained using self.optimizer.param_groups[0]['lr']. This is the standard way to access the learning rate of the first (and often only) parameter group in PyTorch optimizers.
402
+ #self.save_checkpoint(f"model_checkpoint_epoch_{epoch+1}.pth")
403
+ #self.save_checkpoint(f"model_checkpoint_epoch_{epoch + 1}.pth", epoch, average_loss) # Pass epoch and average_loss
404
+ # Early stopping condition
405
+ if average_loss < self.Early_stopping_loss:
406
+ print(f"Early stopping: Average loss {average_loss:.4f} is below the threshold ({self.Early_stopping_loss}).")
407
+ self.save_checkpoint(f"model_checkpoint_early_stop.pth", epoch, average_loss) # Save checkpoint
408
+ break # Exit the training loop
409
+
410
+ def save_checkpoint(self, path, epoch, average_loss):
411
+ # Get the current script's filename
412
+ script_filename = os.path.basename(__file__) # Get filename from the current script path
413
+
414
+ # Get the current date and time
415
+ current_datetime = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
416
+
417
+ # Construct the new filename
418
+ base_filename, extension = os.path.splitext(path) # Split original filename
419
+ new_filename = f"{base_filename}_{script_filename}_{current_datetime}{extension}"
420
+
421
+ torch.save({
422
+ 'epoch': epoch,
423
+ 'model_state_dict': self.model.state_dict(),
424
+ 'optimizer_state_dict': self.optimizer.state_dict(),
425
+ 'loss': average_loss,
426
+ 'hyperparameters': self.hyperparameters
427
+ }, new_filename)
428
+
429
+
430
+ # --- Main Execution ---
431
+ def main():
432
+ # Determine device (GPU if available, else CPU)
433
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
434
+ print(f"Using device: {device}")
435
+
436
+ print_with_line("# Initialize tokenizer")
437
+ #tokenizer = Tokenizer(Memorized_Speech)
438
+ tokenizer = Tokenizer(Memorized_Speech, hyperparameters["special_tokens"], hyperparameters["vocab_size"]) # The special_tokens list is now defined in the hyperparameters dictionary.
439
+ print(f"Tokenizer Vocabulary Size: {tokenizer.vocab_size}")
440
+
441
+ # Update vocab_size in hyperparameters
442
+ hyperparameters["vocab_size"] = tokenizer.vocab_size
443
+ print(f"The Hyperparamter vocab_size (Vocabulary Size) is set to: {tokenizer.vocab_size}") # Print calculated size
444
+
445
+ print_with_line("# Prepare dataset")
446
+ #dataset = Dataset(Memorized_Speech, tokenizer, hyperparameters["max_sequence_len"])
447
+ dataset = Dataset(Memorized_Speech, tokenizer, min_training_input_seq_len) # Common values of min_training_input_seq_len for smaller models or experiments are 32, 64, 128, or 256.
448
+ train_loader = DataLoader(dataset, batch_size=hyperparameters["batch_size"])
449
+
450
+ print_with_line("# Initialize model")
451
+ print(f"HyperParamters = {hyperparameters}")
452
+ model = ToyGPT2(hyperparameters).to(device)
453
+
454
+ print_with_line("# Initialize trainer")
455
+ trainer = Trainer(model, tokenizer, train_loader, hyperparameters, device)
456
+ # Update vocab_size in hyperparameters
457
+
458
+ print_with_line("# Train the model")
459
+ trainer.train()
460
+
461
+ print("") # space
462
+ print_with_line("# --- Inference Examples ---")
463
+ model.eval()
464
+
465
+ # Example 1: Recite the Gettysburg Address
466
+ print_with_line("# Example 1: Recite the Gettysburg Address")
467
+ start_text = "four score"
468
+ start_tokens = torch.tensor(tokenizer.tokenize(start_text)).unsqueeze(0).to(device)
469
+ print("Prompt:", start_text)
470
+ generated_tokens = model.generate(start_tokens, max_new_tokens=len(dataset.tokens)-len(start_tokens), temperature=1.0) # Generate a completion for the whole dataset
471
+ generated_text = tokenizer.detokenize(generated_tokens.squeeze().tolist())
472
+ print("\nResponse:\n", generated_text)
473
+
474
+ print("") # space
475
+ # Example 2: Free text generation after encountering <FreetheLLM> #### Eventually, modify to request user text inxlusinf only Gettysburg vocabulary]
476
+ print_with_line("# Example 2: Free text generation after encountering <FreetheLLM>")
477
+
478
+ start_text = "we here highly resolve that these dead shall not have died in vain and that this nation under god shall have a new "
479
+ special_token = tokenizer.special_tokens[0] # Get the <FreetheLLM> token
480
+ start_text += special_token # Append the special token directly to the string
481
+ print("Prompt:", start_text)
482
+
483
+ start_tokens = torch.tensor(tokenizer.tokenize(start_text)).unsqueeze(0).to(device) # Tokenize the combined string
484
+
485
+ generated_tokens = model.generate(start_tokens, max_new_tokens=100, temperature=1.0)
486
+ generated_text = tokenizer.detokenize(generated_tokens.squeeze().tolist())
487
+ print("\nFreestyle Generation:\n", generated_text)
488
+
489
+ print(f"HyperParamters = {hyperparameters}")
490
+
491
+ if __name__ == "__main__":
492
+ main()