MartialTerran commited on
Commit
aa29bb9
1 Parent(s): 63d3753

Update With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success.

Browse files
With one layer, n_layer 1, n_embd 4 is failure. but n_embd 6 is marginal success. CHANGED
@@ -5,7 +5,7 @@ Upon adding a second layer, (n_embd': 4, 'n_layer': 2) [Epoch 53525/100000, Loss
5
  Four floats of embeddings is apparently sufficient to support some sequencing, but not quite enough information to sequence so many different/same words and punctuations with. (Microsoft reserearchers recently found that in other LLMs that entire attention heads were focused on "punctution")
6
 
7
  See https://medium.com/@thethoughtpalette/are-tiny-transformers-the-future-of-scaling-626594655c48
8
- Quote: "4. Overfitting: Due to their small size, tiny transformers are prone to overfitting on limited datasets. This leads to reduced generalizability, making them less effective when faced with new or varied data inputs.
9
 
10
  At n_embd': 6, 'n_layer': 1, 'n_head': 1, 'n_inner': 64, the Toy Gettysburg GPT-2 model got a good start with "four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in" before some glitches. But resumed another whole part of the Gettysburg speech: "that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so conceived and so dedicated , can long endure . we are met on a great battle - field of that war . we have come to dedicate a portion of that field , as a final resting place for those who here gave their lives that that nation might endure "
11
 
 
5
  Four floats of embeddings is apparently sufficient to support some sequencing, but not quite enough information to sequence so many different/same words and punctuations with. (Microsoft reserearchers recently found that in other LLMs that entire attention heads were focused on "punctution")
6
 
7
  See https://medium.com/@thethoughtpalette/are-tiny-transformers-the-future-of-scaling-626594655c48
8
+ Quote: "4. Overfitting: Due to their small size, tiny transformers are prone to overfitting on limited datasets. This leads to reduced generalizability, making them less effective when faced with new or varied data inputs. ... Ongoing research continues to focus on refining the mechanisms of these models, striving to enhance their performance while minimizing their footprint.""
9
 
10
  At n_embd': 6, 'n_layer': 1, 'n_head': 1, 'n_inner': 64, the Toy Gettysburg GPT-2 model got a good start with "four score and seven years ago our fathers brought forth on this continent , a new nation , conceived in" before some glitches. But resumed another whole part of the Gettysburg speech: "that all men are created equal . now we are engaged in a great civil war , testing whether that nation , or any nation so conceived and so dedicated , can long endure . we are met on a great battle - field of that war . we have come to dedicate a portion of that field , as a final resting place for those who here gave their lives that that nation might endure "
11