jrahn
/

gpt2_350M_edu_hermes

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

jrahn commited on Jul 24

Commit

37d7cbd

•

1 Parent(s): b9249c4

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ p("In a shocking finding, scientist discovered a herd of unicorns living in a re
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-Datasets used: Fineweb-Edu 10B + OpenHermes 2.5 (chatml)
 Dataset proportions:
 - Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050
@@ -69,7 +69,7 @@ Total documents: 10,669,024
 - **Training regime:**
 - bf16
 - context length 1024
-- per device batch size 16, global batch size 524288, gradient accumulation 16
 - zero stage 1
 - lr 3e-4, cosine schedule, 700 warmup steps
 - more details see [run script](run_gpt2_350M_edu_hermes.sh)
@@ -78,9 +78,9 @@ Total documents: 10,669,024
 <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-Params: 355M -> Checkpoint: 710MB
-Tokens: ~10B
-Total training time: 30hrs
 Hardware: 2x RTX4090
 MFU: 71% (110,000 tok/s)

 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+Datasets used: Fineweb-Edu 10B + OpenHermes 2.5
 Dataset proportions:
 - Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050
 - **Training regime:**
 - bf16
 - context length 1024
+- per device batch size 16, global batch size 524,288 -> gradient accumulation 16
 - zero stage 1
 - lr 3e-4, cosine schedule, 700 warmup steps
 - more details see [run script](run_gpt2_350M_edu_hermes.sh)
 <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+Params: 355M -> 710MB / checkpoint
+Tokens: ~10B (10,287,579,136)
+Total training time: ~30hrs
 Hardware: 2x RTX4090
 MFU: 71% (110,000 tok/s)