Update README.md
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ Inspired by the TinyStories research, which explores the effectiveness of small
|
|
29 |
For detailed training procedures and configurations, please refer to [this GitHub repository](https://github.com/jia-zhuang/chinese-llama2.c).
|
30 |
- **Hardware:** Trained on an NVIDIA RTX 2080 Super with 8 GB RAM—a modest gaming rig.
|
31 |
- **Duration:** 87 hours (just over 3.5 days), covering 20k iterations and processing 2G tokens.
|
32 |
-
- **Optimizer:** AdamW, with a learning rate (lr) of 5e-4,
|
33 |
- **Dropout:** no
|
34 |
- **Batch Size:** 4, configured to fit within the 8GB RAM of the 2080; gradient accumulation steps set at 128, achieving an effective 524,288 tokens per iteration as suggested by the Chinchilla paper ([Chinchilla study](https://arxiv.org/abs/2203.15556)).
|
35 |
- **Training Iterations:** 20k, including a warm-up phase of 1k steps.
|
|
|
29 |
For detailed training procedures and configurations, please refer to [this GitHub repository](https://github.com/jia-zhuang/chinese-llama2.c).
|
30 |
- **Hardware:** Trained on an NVIDIA RTX 2080 Super with 8 GB RAM—a modest gaming rig.
|
31 |
- **Duration:** 87 hours (just over 3.5 days), covering 20k iterations and processing 2G tokens.
|
32 |
+
- **Optimizer:** AdamW, with a learning rate (lr) of 5e-4, with 1000 warm-up iterations. gradient clipping at 1.0.
|
33 |
- **Dropout:** no
|
34 |
- **Batch Size:** 4, configured to fit within the 8GB RAM of the 2080; gradient accumulation steps set at 128, achieving an effective 524,288 tokens per iteration as suggested by the Chinchilla paper ([Chinchilla study](https://arxiv.org/abs/2203.15556)).
|
35 |
- **Training Iterations:** 20k, including a warm-up phase of 1k steps.
|