MartialTerran commited on
Commit
63d3753
·
verified ·
1 Parent(s): 2160f60

Update With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success.

Browse files
With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success. CHANGED
@@ -4,6 +4,9 @@ Upon adding a second layer, (n_embd': 4, 'n_layer': 2) [Epoch 53525/100000, Loss
4
 
5
  Four floats of embeddings is apparently sufficient to support some sequencing, but not quite enough information to sequence so many different/same words and punctuations with. (Microsoft reserearchers recently found that in other LLMs that entire attention heads were focused on "punctution")
6
 
 
 
 
7
  At n_embd': 6, 'n_layer': 1, 'n_head': 1, 'n_inner': 64, the Toy Gettysburg GPT-2 model got a good start with "four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in" before some glitches. But resumed another whole part of the Gettysburg speech: "that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so conceived and so dedicated , can long endure . we are met on a great battle - field of that war . we have come to dedicate a portion of that field , as a final resting place for those who here gave their lives that that nation might endure "
8
 
9
  Adding a second layer to the 6-float model (n_embd': 6, 'n_layer': 2, 'n_head': 1, 'n_inner': 64,) (and no other modifications) did solve the glitch, after almost 60,000 epochs (and an expertly timed gradually-receeding learning rate):
 
4
 
5
  Four floats of embeddings is apparently sufficient to support some sequencing, but not quite enough information to sequence so many different/same words and punctuations with. (Microsoft reserearchers recently found that in other LLMs that entire attention heads were focused on "punctution")
6
 
7
+ See https://medium.com/@thethoughtpalette/are-tiny-transformers-the-future-of-scaling-626594655c48
8
+ Quote: "4. Overfitting: Due to their small size, tiny transformers are prone to overfitting on limited datasets. This leads to reduced generalizability, making them less effective when faced with new or varied data inputs.
9
+
10
  At n_embd': 6, 'n_layer': 1, 'n_head': 1, 'n_inner': 64, the Toy Gettysburg GPT-2 model got a good start with "four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in" before some glitches. But resumed another whole part of the Gettysburg speech: "that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so conceived and so dedicated , can long endure . we are met on a great battle - field of that war . we have come to dedicate a portion of that field , as a final resting place for those who here gave their lives that that nation might endure "
11
 
12
  Adding a second layer to the 6-float model (n_embd': 6, 'n_layer': 2, 'n_head': 1, 'n_inner': 64,) (and no other modifications) did solve the glitch, after almost 60,000 epochs (and an expertly timed gradually-receeding learning rate):