crumb
/

gpt2a-pile-test-285m

Text Generation

Transformers

Safetensors

English

gpt2a

custom_code

Model card Files Files and versions Community

crumb commited on Dec 28, 2023

Commit

cc731f1

1 Parent(s): ae60778

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -1

README.md CHANGED Viewed

@@ -6,4 +6,27 @@ language:
 ---
 This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~2 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.
-This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.

 ---
 This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~2 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.
+This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.
+### Use
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "crumb/pile-test-model"
+model = AutoModelForCausalLM.from_pretrained(
+  model_id,
+  trust_remote_code = True,
+  device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids
+outputs = model.generate(input_ids, max_new_tokens=64, temperature=0.8, do_sample=True)
+print(tokenizer.batch_decode(outputs)[0])
+"""
+<s> Once upon a time after a high-profile lawsuit, the FBI will move further into the investigation, according to Reuters
+The FBI will have the ability to identify and prosecute a person who, according to Reuters, is a former FBI agent who was arrested and charged with the crime of plot
+"""
+```