MartialTerran
/

Toy_GPTs_LLMs_for_CPU_Educational

Model card Files Files and versions Community

MartialTerran commited on Nov 29, 2024

Commit

63d3753

verified ·

1 Parent(s): 2160f60

Update With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success.

Browse files

Files changed (1) hide show

With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success. +3 -0

With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success. CHANGED Viewed

@@ -4,6 +4,9 @@ Upon adding a second layer, (n_embd': 4, 'n_layer': 2) [Epoch 53525/100000, Loss
 Four floats of embeddings is apparently sufficient to support some sequencing, but not quite enough information to sequence so many different/same words and punctuations with.  (Microsoft reserearchers recently found that in other LLMs that entire attention heads were focused on "punctution")
 At n_embd': 6, 'n_layer': 1, 'n_head': 1, 'n_inner': 64, the Toy Gettysburg GPT-2 model got a good start with "four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in" before some glitches. But resumed another whole part of the Gettysburg speech: "that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so conceived and so dedicated , can long endure . we are met on a great battle - field of that war . we have come to dedicate a portion of that field , as a final resting place for those who here gave their lives that that nation might endure "
 Adding a second layer to the 6-float model (n_embd': 6, 'n_layer': 2, 'n_head': 1, 'n_inner': 64,) (and no other modifications) did solve the glitch, after almost 60,000 epochs (and an expertly timed gradually-receeding learning rate):

 Four floats of embeddings is apparently sufficient to support some sequencing, but not quite enough information to sequence so many different/same words and punctuations with.  (Microsoft reserearchers recently found that in other LLMs that entire attention heads were focused on "punctution")
+See https://medium.com/@thethoughtpalette/are-tiny-transformers-the-future-of-scaling-626594655c48
+Quote: "4. Overfitting:  Due to their small size, tiny transformers are prone to overfitting on limited datasets. This leads to reduced generalizability, making them less effective when faced with new or varied data inputs.
 At n_embd': 6, 'n_layer': 1, 'n_head': 1, 'n_inner': 64, the Toy Gettysburg GPT-2 model got a good start with "four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in" before some glitches. But resumed another whole part of the Gettysburg speech: "that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so conceived and so dedicated , can long endure . we are met on a great battle - field of that war . we have come to dedicate a portion of that field , as a final resting place for those who here gave their lives that that nation might endure "
 Adding a second layer to the 6-float model (n_embd': 6, 'n_layer': 2, 'n_head': 1, 'n_inner': 64,) (and no other modifications) did solve the glitch, after almost 60,000 epochs (and an expertly timed gradually-receeding learning rate):