Update README.md
Browse files
README.md
CHANGED
@@ -45,7 +45,7 @@ p("In a shocking finding, scientist discovered a herd of unicorns living in a re
|
|
45 |
|
46 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
47 |
|
48 |
-
Datasets used: Fineweb-Edu 10B + OpenHermes 2.5
|
49 |
|
50 |
Dataset proportions:
|
51 |
- Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050
|
@@ -69,7 +69,7 @@ Total documents: 10,669,024
|
|
69 |
- **Training regime:**
|
70 |
- bf16
|
71 |
- context length 1024
|
72 |
-
- per device batch size 16, global batch size
|
73 |
- zero stage 1
|
74 |
- lr 3e-4, cosine schedule, 700 warmup steps
|
75 |
- more details see [run script](run_gpt2_350M_edu_hermes.sh)
|
@@ -78,9 +78,9 @@ Total documents: 10,669,024
|
|
78 |
|
79 |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
80 |
|
81 |
-
Params: 355M ->
|
82 |
-
Tokens: ~10B
|
83 |
-
Total training time: 30hrs
|
84 |
Hardware: 2x RTX4090
|
85 |
MFU: 71% (110,000 tok/s)
|
86 |
|
|
|
45 |
|
46 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
47 |
|
48 |
+
Datasets used: Fineweb-Edu 10B + OpenHermes 2.5
|
49 |
|
50 |
Dataset proportions:
|
51 |
- Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050
|
|
|
69 |
- **Training regime:**
|
70 |
- bf16
|
71 |
- context length 1024
|
72 |
+
- per device batch size 16, global batch size 524,288 -> gradient accumulation 16
|
73 |
- zero stage 1
|
74 |
- lr 3e-4, cosine schedule, 700 warmup steps
|
75 |
- more details see [run script](run_gpt2_350M_edu_hermes.sh)
|
|
|
78 |
|
79 |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
80 |
|
81 |
+
Params: 355M -> 710MB / checkpoint
|
82 |
+
Tokens: ~10B (10,287,579,136)
|
83 |
+
Total training time: ~30hrs
|
84 |
Hardware: 2x RTX4090
|
85 |
MFU: 71% (110,000 tok/s)
|
86 |
|