vijaymohan commited on
Commit
0d0d4f8
·
verified ·
1 Parent(s): e7353a8

Initial upload

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPT-2-Style TinyStories Model (From Scratch)
2
+
3
+ ## Overview
4
+ This repository contains a GPT-2–style language model trained from scratch on the [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset using Hugging Face’s Transformers library on Google Colab Pro+ A100 GPU.
5
+ The training objective was to build a small, educational, and easily reproducible transformer LM for story generation.
6
+
7
+ **This model is designed for:**
8
+ - Researchers exploring end-to-end LLM training workflows.
9
+ - Beginners who want a hands-on example of training a transformer from scratch.
10
+ - Educators demonstrating modern NLP model development without huge compute budgets.
11
+
12
+ ---
13
+ ## Hardware & Environment
14
+ - **Platform**: Google Colab Pro+
15
+ - **GPU**: NVIDIA A100 (40 GB VRAM)
16
+ - **CPU RAM**: 83.5 GB
17
+ - **Disk**: 235.7 GB
18
+ - **Python**: 3.x (Colab default)
19
+ - **Frameworks**:
20
+ - `transformers` (latest from pip)
21
+ - `datasets`
22
+ - `accelerate`
23
+ - `huggingface_hub`
24
+
25
+ ---
26
+ ## Dataset
27
+ **Dataset**: `roneneldan/TinyStories` — a curated synthetic dataset of short children’s stories.
28
+ - **Language**: English
29
+ - **Cleanliness**: High — minimal preprocessing needed
30
+ - **Structure**: Each sample contains a single text field with a complete story
31
+
32
+ **Why this dataset?**
33
+ - High signal-to-noise ratio.
34
+ - Ideal for small models — vocabulary is modest, sentence structures are simple.
35
+ - Useful for quick iterations and visible training convergence.
36
+
37
+ ---
38
+ ## Model Architecture
39
+ A small GPT-2–like causal language model:
40
+
41
+ | Hyperparameter | Value |
42
+ |-----------------|---------|
43
+ | Layers (n_layer) | 8 |
44
+ | Attention Heads (n_head) | 8 |
45
+ | Embedding Dim (n_embd) | 256 |
46
+ | Vocabulary Size | 16,384 |
47
+ | Sequence Length (block_size) | 512 |
48
+ | Params (approx.) | ~10–12M |
49
+ | Rotary Positional Embeddings | Disabled |
50
+ | Dropout | 0.0 |
51
+ | Loss Function | ForCausalLMLoss (auto-selected) |
52
+
53
+ ---
54
+ ## Training Setup
55
+ ```python
56
+ TrainingArguments(
57
+ num_train_epochs = 3,
58
+ per_device_train_batch_size = 128,
59
+ per_device_eval_batch_size = 128,
60
+ gradient_accumulation_steps = 1,
61
+ learning_rate = 3e-4,
62
+ weight_decay = 0.1,
63
+ warmup_ratio = 0.03,
64
+ logging_steps = 50,
65
+ save_steps = 500,
66
+ save_total_limit = 3,
67
+ bf16 = True, # Mixed precision
68
+ fp16 = False,
69
+ evaluation_strategy = "steps",
70
+ eval_steps = 500,
71
+ )
72
+ ```
73
+ - **Optimizer**: AdamW (default in HF Trainer)
74
+ - **Data Loading**: `datasets` streaming & tokenization with `block_size=512`
75
+ - **Collator**: `DataCollatorForLanguageModeling` with `mlm=False`
76
+
77
+ ---
78
+ ## Tokenization & Preprocessing
79
+ ```python
80
+ from itertools import chain
81
+
82
+ def tokenize_fn(batch):
83
+ return tokenizer(batch["text"], add_special_tokens=False)
84
+
85
+ tokenized = raw.map(tokenize_fn, batched=True, remove_columns=raw['train'].column_names)
86
+
87
+ def group_texts(examples):
88
+ concatenated = list(chain(*examples["input_ids"]))
89
+ total_length = (len(concatenated) // CFG.block_size) * CFG.block_size
90
+ concatenated = concatenated[:total_length]
91
+ result = {
92
+ "input_ids": [concatenated[i:i+CFG.block_size] for i in range(0, total_length, CFG.block_size)]
93
+ }
94
+ result["labels"] = result["input_ids"].copy()
95
+ return result
96
+
97
+ lm_datasets = tokenized.map(group_texts, batched=True)
98
+ ```
99
+
100
+ ---
101
+ ## Training Run & Metrics
102
+ - **Total steps**: 21,081
103
+ - **Total FLOPs**: 5.24 × 10^16
104
+ - **Runtime**: ~1h 44m on A100 (Colab)
105
+ - **Final Train Loss**: 1.8054
106
+
107
+ Loss curve snapshot (selected steps):
108
+ ```yaml
109
+ Step Loss
110
+ 50 9.2160
111
+ 100 8.2987
112
+ 500 3.6695
113
+ 1000 2.6862
114
+ 5000 1.7699
115
+ 10000 1.6385
116
+ 15000 1.5620
117
+ 21000 1.5140
118
+ ```
119
+ **Interpretation**:
120
+ Rapid drop in loss during early steps indicates effective learning.
121
+ Final loss ≈ 1.51 suggests the model has learned coherent structure and vocabulary use for TinyStories-style text.
122
+
123
+ ---
124
+ ## Inference Example
125
+ ```python
126
+ from transformers import AutoTokenizer, AutoModelForCausalLM
127
+ import torch
128
+
129
+ repo_id = "vijaymohan/gpt2-tinystories-from-scratch-10m"
130
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
131
+ if tokenizer.pad_token is None:
132
+ tokenizer.pad_token = tokenizer.eos_token
133
+
134
+ model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float16)
135
+ if torch.cuda.is_available():
136
+ model.to("cuda")
137
+
138
+ prompt = "One day, a little girl named Lily found a needle in her"
139
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
140
+ with torch.inference_mode():
141
+ outputs = model.generate(
142
+ **inputs,
143
+ max_new_tokens=100,
144
+ do_sample=True,
145
+ temperature=0.7,
146
+ top_p=0.9,
147
+ repetition_penalty=1.1,
148
+ eos_token_id=tokenizer.eos_token_id,
149
+ pad_token_id=tokenizer.pad_token_id
150
+ )
151
+
152
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
153
+ ```
154
+
155
+ ---
156
+ ## Lessons & Recommendations for Newcomers
157
+ - **Start Small** — Begin with a small dataset and small model. You’ll see results quickly without burning GPU time.
158
+ - **Mixed Precision (bf16/fp16)** — Saves VRAM and speeds up training.
159
+ - **Clean Data** — High-quality datasets like TinyStories make it easier to reach good results.
160
+ - **Checkpoints** — Save regularly (`save_steps`) in case Colab disconnects.
161
+ - **Colab Session Stability** — Keep your browser awake, use a stable internet connection.
162
+ - **Publishing Early** — Push checkpoints to Hugging Face to avoid accidental data loss.
163
+
164
+ ---
165
+ ## Limitations
166
+ - Short context length (512 tokens).
167
+ - Limited generalization beyond TinyStories style/content.
168
+ - Not suitable for factual QA or large-context reasoning.