crumb commited on
Commit
cc731f1
·
1 Parent(s): ae60778

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -1
README.md CHANGED
@@ -6,4 +6,27 @@ language:
6
  ---
7
  This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~2 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.
8
 
9
- This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
  This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~2 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.
8
 
9
+ This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.
10
+
11
+ ### Use
12
+
13
+ ```python
14
+ from transformers import AutoModelForCausalLM, AutoTokenizer
15
+
16
+ model_id = "crumb/pile-test-model"
17
+ model = AutoModelForCausalLM.from_pretrained(
18
+ model_id,
19
+ trust_remote_code = True,
20
+ device_map="auto"
21
+ )
22
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
23
+
24
+ input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids
25
+ outputs = model.generate(input_ids, max_new_tokens=64, temperature=0.8, do_sample=True)
26
+ print(tokenizer.batch_decode(outputs)[0])
27
+ """
28
+ <s> Once upon a time after a high-profile lawsuit, the FBI will move further into the investigation, according to Reuters
29
+
30
+ The FBI will have the ability to identify and prosecute a person who, according to Reuters, is a former FBI agent who was arrested and charged with the crime of plot
31
+ """
32
+ ```