File size: 3,903 Bytes
7ac5a08
 
e92bc75
 
 
 
 
 
 
 
 
7ac5a08
 
e92bc75
 
7ac5a08
 
 
e92bc75
a8eeae4
7ac5a08
 
 
 
 
 
 
 
e92bc75
 
 
 
 
 
 
 
 
 
 
 
7ac5a08
 
 
 
 
 
 
e92bc75
 
 
 
 
 
 
7ac5a08
 
 
 
 
 
 
e92bc75
 
 
7ac5a08
 
 
 
e92bc75
 
 
 
 
 
 
7ac5a08
 
 
 
 
f4457ed
e92bc75
 
 
 
 
7ac5a08
 
 
 
 
 
 
e92bc75
 
7ac5a08
 
 
 
 
868fc41
7ac5a08
 
 
 
 
e92bc75
7ac5a08
 
 
e92bc75
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
library_name: transformers
tags:
- llm.c
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
- teknium/OpenHermes-2.5
language:
- en
pipeline_tag: text-generation
---

# Model Card for llm.c GPT3_125M
## Instruction Pretraining: Fineweb-edu 10B interleaved with OpenHermes 2.5

<!-- Provide a quick summary of what the model is/does. -->

![Loss](loss_curve.png)
Compare training on fineweb-edu 10b only vs. interleaved

## Model Details


## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import pipeline
p = pipeline("text-generation", "jrahn/gpt3_125M_edu_hermes")

# instruction following
p("<|im_start|>user\nTeach me to fish.<|im_end|>\n<|im_start|>assistant\n", max_length=128)
# [{'generated_text': '<|im_start|>user\nTeach me to fish.<|im_end|>\n<|im_start|>assistant\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\nTeach me to fish.\n\n'}]

# text completion
p("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. ", max_length=128)
# [{'generated_text': 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. \nThe researchers were able to identify the unicorns by their unique language. The researchers found that the unicorns spoke a language that is similar to the language of the Andes Mountains.\nThe researchers also found that the unicorns spoke a language that is similar to the language of the Andes Mountains. This is the first time that the researchers have been able to identify the language of the Andes Mountains.'}]
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Datasets used: Fineweb-Edu 10B + OpenHermes 2.5

Dataset proportions:  
- Part 1: FWE 4,836,050 + OH 100,000 (2.03%) = 4,936,050  
- Part 2: FWE 4,336,051 + OH 400,000 (8.45%) = 4,736,051  
- Part 3: FWE 500,000 + OH 501,551 (50.08%) = 1,001,551  
Total documents: 10,669,024  

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing [optional]

- Fineweb-Edu: none, just the "text" feature  
- OpenHermes 2.5: applied ChatML prompt template to "conversations" to create the "text" feature



#### Training Hyperparameters

- **Training regime:**
- bf16
- context length 2048
- per device batch size 16, global batch size 524,288 -> gradient accumulation 16
- zero stage 1
- lr 6e-4, cosine schedule, 700 warmup steps
- more details see [run script](run_gpt3_150M_edu_hermes.sh)

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

Params: 125M -> 250MB / checkpoint  
Tokens: ~10B (10,287,579,136)  
Total training time: ~12hrs  
Hardware: 2x RTX4090  
MFU: 70% (266,000 tok/s)  


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Results

HellaSwag: 30.5  
- more details see [main.log](main.log)

## Technical Specifications [optional]

### Model Architecture and Objective

GTP3 125M, Causal Language Modeling

### Compute Infrastructure

#### Hardware

2x RTX4090

#### Software

[llm.c](https://github.com/karpathy/llm.c)