rwmasood commited on
Commit
9e4bc24
·
verified ·
1 Parent(s): 7b682b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -6
README.md CHANGED
@@ -25,6 +25,14 @@ base_model:
25
  ## Training
26
 
27
 
 
 
 
 
 
 
 
 
28
 
29
  ## Usage
30
 
@@ -34,14 +42,12 @@ base_model:
34
  ```python
35
  import torch
36
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
37
-
38
- tokenizer = AutoTokenizer.from_pretrained("empirischtech/Llama-3.1-10b-instruct")
39
  model = AutoModelForCausalLM.from_pretrained(
40
- "upstage/llama-65b-instruct",
41
  device_map="auto",
42
- torch_dtype=torch.float16,
43
- load_in_8bit=True,
44
- rope_scaling={"type": "dynamic", "factor": 2} # allows handling of longer inputs
45
  )
46
 
47
  prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"
 
25
  ## Training
26
 
27
 
28
+ Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
29
+
30
+ ### **More Scientific Version:**
31
+
32
+ To address the escalating computational costs of training large-scale models, various approaches have been proposed. For instance, **[arXiv.2212.05055](https://doi.org/10.48550/arXiv.2212.05055)** demonstrates a method where pretrained large models are upscaled by selectively retaining dense layers, followed by continued pretraining. This strategy can potentially reduce the training budget by up to **50%** while maintaining performance.
33
+
34
+ In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.
35
+
36
 
37
  ## Usage
38
 
 
42
  ```python
43
  import torch
44
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
45
+ model_id="empirischtech/Llama-3.1-10b-instruct"
46
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
47
  model = AutoModelForCausalLM.from_pretrained(
48
+ model_id,
49
  device_map="auto",
50
+ torch_dtype=torch.float16
 
 
51
  )
52
 
53
  prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"