rwmasood commited on
Commit
2ac7171
·
verified ·
1 Parent(s): 7434793

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -5
README.md CHANGED
@@ -24,12 +24,9 @@ license: llama3.1
24
  * **Contact**: For questions and comments about the model, please email [contact-us](https://chaperoneai.net/contact)
25
 
26
  ## Training
27
-
28
-
29
  Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
30
-
31
- To address the escalating computational costs of training large-scale models, various approaches have been proposed. For instance, **[arXiv.2212.05055](https://doi.org/10.48550/arXiv.2212.05055)** demonstrates a method where pretrained large models are upscaled by selectively retaining dense layers called **Mixture-of-Experts (MoE)**, followed by continued pretraining. This strategy can potentially reduce the training budget by up to **50%** while maintaining performance.
32
-
33
  In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.
34
 
35
 
 
24
  * **Contact**: For questions and comments about the model, please email [contact-us](https://chaperoneai.net/contact)
25
 
26
  ## Training
 
 
27
  Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
28
+ To address the escalating computational costs of training large-scale models, various approaches have been proposed.
29
+ We present our results validating depth up-scaling—a method that combines depthwise scaling with continued pretraining. Unlike other LLM up-scaling approaches that rely on mixture-of-experts, DUS requires no complex modifications for efficient training and inference, making it a simple yet effective strategy for scaling high-performance LLMs from smaller models.
 
30
  In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.
31
 
32