File size: 5,916 Bytes
9758396 0d0d4f8 05adc1f 0d0d4f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
language: en
license: mit
tags:
- gpt2
- causal-lm
- from-scratch
- tinystories
datasets:
- roneneldan/TinyStories
library_name: transformers
pipeline_tag: text-generation
---
# GPT-2-Style TinyStories Model (From Scratch)
## Overview
This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.
The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.
**This model is designed for:**
- Researchers exploring end-to-end LLM training workflows.
- Beginners who want a hands-on example of training a transformer from scratch.
- Educators demonstrating modern NLP model development without huge compute budgets.
---
## Hardware & Environment
- **Platform**: Google Colab Pro+
- **GPU**: NVIDIA A100 (40 GB VRAM)
- **CPU RAM**: 83.5 GB
- **Disk**: 235.7 GB
- **Python**: 3.x (Colab default)
- **Frameworks**:
- `transformers` (latest from pip)
- `datasets`
- `accelerate`
- `huggingface_hub`
---
## Dataset
**Dataset**: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories.
- **Language**: English
- **Cleanliness**: High — minimal preprocessing needed
- **Structure**: Each sample contains a single text field with a complete story
**Why this dataset?**
- High signal-to-noise ratio.
- Ideal for small models — vocabulary is modest, sentence structures are simple.
- Useful for quick iterations and visible training convergence.
---
## Model Architecture
A small GPT-2–like causal language model:
| Hyperparameter | Value |
|-----------------|---------|
| Layers (n_layer) | 8 |
| Attention Heads (n_head) | 8 |
| Embedding Dim (n_embd) | 256 |
| Vocabulary Size | 16,384 |
| Sequence Length (block_size) | 512 |
| Params (approx.) | ~10–12M |
| Rotary Positional Embeddings | Disabled |
| Dropout | 0.0 |
| Loss Function | ForCausalLMLoss (auto-selected) |
---
## Training Setup
```python
TrainingArguments(
num_train_epochs = 3,
per_device_train_batch_size = 128,
per_device_eval_batch_size = 128,
gradient_accumulation_steps = 1,
learning_rate = 3e-4,
weight_decay = 0.1,
warmup_ratio = 0.03,
logging_steps = 50,
save_steps = 500,
save_total_limit = 3,
bf16 = True, # Mixed precision
fp16 = False,
evaluation_strategy = "steps",
eval_steps = 500,
)
```
- **Optimizer**: AdamW (default in HF Trainer)
- **Data Loading**: `datasets` streaming & tokenization with `block_size=512`
- **Collator**: `DataCollatorForLanguageModeling` with `mlm=False`
---
## Tokenization & Preprocessing
```python
from itertools import chain
def tokenize_fn(batch):
return tokenizer(batch["text"], add_special_tokens=False)
tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)
def group_texts(examples):
concatenated = list(chain(*examples["input_ids"]))
total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
concatenated = concatenated[:total_length]
result = {
"input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized.map(group_texts, batched=True)
```
---
## Tokens
- **Number of sequences in train set**: 899,394
- **Tokens per step**: 65,536
- **Steps per epoch**: 7,026
- **Total steps**: 21,078
- **Total tokens processed**: 1,381,367,808
---
## Training Run & Metrics
- **Total steps**: 21,081
- **Total FLOPs**: 5.24 × 10^16
- **Runtime**: ~1h 44m on A100 (Colab)
- **Final Train Loss**: 1.8054
Loss curve snapshot (selected steps):
```yaml
Step Loss
50 9.2160
100 8.2987
500 3.6695
1000 2.6862
5000 1.7699
10000 1.6385
15000 1.5620
21000 1.5140
```
**Interpretation**:
Rapid drop in loss during early steps indicates effective learning.
Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.
---
## Inference Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
if torch.cuda.is_available():
model.to("cuda")
prompt = "One day, a little girl named Lily found a needle in her"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Lessons & Recommendations for Newcomers
- **Start Small** — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
- **Mixed Precision (bf16/fp16)** — Saves VRAM and speeds up training.
- **Clean Data** — High-quality datasets like TinyStories make it easier to reach good results.
- **Checkpoints** — Save regularly (`save_steps`) in case Colab disconnects.
- **Colab Session Stability** — Keep your browser awake, use a stable internet connection.
- **Publishing Early** — Push checkpoints to Hugging Face to avoid accidental data loss.
---
## Limitations
- Short context length (512 tokens).
- Limited generalization beyond TinyStories style/content.
- Not suitable for factual QA or large-context reasoning.
|