Update README.md
Browse files
README.md
CHANGED
@@ -25,6 +25,14 @@ base_model:
|
|
25 |
## Training
|
26 |
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
## Usage
|
30 |
|
@@ -34,14 +42,12 @@ base_model:
|
|
34 |
```python
|
35 |
import torch
|
36 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
|
37 |
-
|
38 |
-
tokenizer = AutoTokenizer.from_pretrained(
|
39 |
model = AutoModelForCausalLM.from_pretrained(
|
40 |
-
|
41 |
device_map="auto",
|
42 |
-
torch_dtype=torch.float16
|
43 |
-
load_in_8bit=True,
|
44 |
-
rope_scaling={"type": "dynamic", "factor": 2} # allows handling of longer inputs
|
45 |
)
|
46 |
|
47 |
prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"
|
|
|
25 |
## Training
|
26 |
|
27 |
|
28 |
+
Bigger models, more data, and better hardware have consistently improved deep learning performance. Whether in NLP or computer vision, larger models have led to major breakthroughs. However, most cutting-edge models are still trained from scratch, meaning they start with randomly initialized weights. The problem? Training costs are skyrocketing.
|
29 |
+
|
30 |
+
### **More Scientific Version:**
|
31 |
+
|
32 |
+
To address the escalating computational costs of training large-scale models, various approaches have been proposed. For instance, **[arXiv.2212.05055](https://doi.org/10.48550/arXiv.2212.05055)** demonstrates a method where pretrained large models are upscaled by selectively retaining dense layers, followed by continued pretraining. This strategy can potentially reduce the training budget by up to **50%** while maintaining performance.
|
33 |
+
|
34 |
+
In this work, we take a step toward realizing such an approach. Specifically, we extend an existing **8B**-parameter model to **10B** parameters by initializing the additional layers with pretrained weights, followed by continued pretraining on a smaller dataset across multiple epochs. Due to budget constraints, we were unable to surpass the foundational model on the **EleutherAI** evaluation benchmark. However, our approach yielded improved performance in terms of **perplexity**, demonstrating potential for cost-efficient scaling strategies in large language model development.
|
35 |
+
|
36 |
|
37 |
## Usage
|
38 |
|
|
|
42 |
```python
|
43 |
import torch
|
44 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
|
45 |
+
model_id="empirischtech/Llama-3.1-10b-instruct"
|
46 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
47 |
model = AutoModelForCausalLM.from_pretrained(
|
48 |
+
model_id,
|
49 |
device_map="auto",
|
50 |
+
torch_dtype=torch.float16
|
|
|
|
|
51 |
)
|
52 |
|
53 |
prompt = "### User:\nThomas is healthy, but he has to go to the hospital. What could be the reasons?\n\n### Assistant:\n"
|