empirischtech
/

Llama-3.1-10B-Instruct

Text Generation

Model card Files Files and versions Community

rwmasood commited on 15 days ago

Commit

2ac7171

·

verified ·

1 Parent(s): 7434793

Update README.md

Files changed (1) hide show

README.md +2 -5

README.md CHANGED Viewed

@@ -24,12 +24,9 @@ license: llama3.1
 * **Contact**: For questions and comments about the model, please email [contact-us](https://chaperoneai.net/contact)
 ## Training
 Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
-To address the escalating computational costs of training large-scale models, various approaches have been proposed. For instance, **[arXiv.2212.05055](https://doi.org/10.48550/arXiv.2212.05055)** demonstrates a method where pretrained large models are upscaled by selectively retaining dense layers called **Mixture-of-Experts (MoE)**, followed by continued pretraining. This strategy can potentially reduce the training budget by up to **50%** while maintaining performance.
 In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.

 * **Contact**: For questions and comments about the model, please email [contact-us](https://chaperoneai.net/contact)
 ## Training
 Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
+To address the escalating computational costs of training large-scale models, various approaches have been proposed.
+We present our results validating depth up-scaling—a method that combines depthwise scaling with continued pretraining. Unlike other LLM up-scaling approaches that rely on mixture-of-experts, DUS requires no complex modifications for efficient training and inference, making it a simple yet effective strategy for scaling high-performance LLMs from smaller models.
 In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.