File size: 3,420 Bytes
ae60778
 
 
 
 
 
5319aaa
fb21cfe
5319aaa
cc731f1
 
 
94d925e
 
cc731f1
 
 
fb21cfe
cc731f1
 
 
 
 
 
 
ce748b7
 
cc731f1
 
ce748b7
cc731f1
ce748b7
 
 
 
 
cc731f1
fb21cfe
ce748b7
fb21cfe
 
 
 
 
 
 
 
 
5f60061
 
 
 
9468d44
5f60061
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
datasets:
- EleutherAI/the_pile_deduplicated
language:
- en
---

# GPT2A-Pile-Test-285M


### Use

requires: `transformers, einops`

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "crumb/gpt2a-pile-test-285m"
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  trust_remote_code = True,
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

input_ids = tokenizer("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.", return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128, temperature=0.7, do_sample=True, top_p=0.95, repetition_penalty=1.1)
print(tokenizer.batch_decode(outputs)[0])
"""
<s> In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. It turns out that only a few people believed their father’s name had been made public by their mother and her father.

As the study finds, scientists discovered the secretive organic matter in the Andes mountains: “In a forest surrounded by a stream of lakes, an unseen rock formed with a series of tuberous orbits.”

They found that the mysterious bodies were buried in some parts of the mountain, known as the Andes mountain. The researchers then searched for the body, which they identified with the Butterfly.

The discovery of the body is the result of
"""
```
(it's... a little undertrained! thats okay!)

### Parameter count

| param calculation | params |
| --- | --- |
| model | 809,579,521 |
| model - model.transformer.wte | 539,045,889 |
| model - model.transformer.wte[0] *(llama2-70b embeddings without projection)* | 547,435,521 |
| model - model.transformer.wte - model.lm_head | 268,505,089 |
| model - model.transformer.wte[0] - model.lm_head[1] *(minus all params taken from llama2-70b)* | 285,291,521 |

### Details

This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~1 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.

This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.