Custom GPT Language Model
A custom GPT-style language model trained on the HuggingFaceFW/fineweb dataset. This model is designed for efficient training on consumer GPUs while maintaining good performance characteristics.
Model Architecture
- Architecture Type: GPT-style Transformer
- Model Size: ~100M parameters
- Context Length: 512 tokens
- Embedding Dimension: 640
- Attention Heads: 10
- Layers: 12
- Vocabulary Size: 50,000 tokens
- Training Precision: Mixed FP16
Parameter Count Breakdown
- Token Embeddings: 32M parameters (50,000 × 640)
- Position Embeddings: 0.3M parameters (512 × 640)
- Transformer Blocks: 67.7M parameters (12 layers × [attention + feed-forward])
- Each block: ~5.6M parameters
- Self-attention: 1.6M parameters per block
- Feed-forward: 4M parameters per block
- Layer Normalization: ~0.003M parameters
- Total: ~100M parameters
Features
- ByteLevelBPE tokenizer with special tokens support
- DeepSpeed ZeRO Stage-2 optimization
- Gradient checkpointing option for memory efficiency
- Streaming dataset support for handling large datasets
- Wandb integration for experiment tracking
- FP16 mixed precision training
- Efficient data loading with dynamic batching
Training
The model is trained using:
- HuggingFaceFW/fineweb dataset
- AdamW optimizer with weight decay
- Learning rate: 1e-4 with warmup
- Gradient clipping at 1.0
- Batch size: 64 (8 per GPU × 8 gradient accumulation steps)
- Training epochs: 3
- Target dataset size: 2.5GB
Requirements
pip install torch transformers accelerate deepspeed wandb tqdm
Usage
Training
python src/train.py
Inference
from src.inference import generate_text
prompt = "Once upon a time"
generated_text = generate_text(
prompt=prompt,
max_new_tokens=50,
temperature=0.7,
top_k=50,
top_p=0.95
)
print(generated_text)
Configuration
The model and training parameters can be configured in config/config.yaml
. Key configurations include:
model:
vocab_size: 50000
n_embd: 640
n_layer: 12
n_head: 10
n_positions: 512
training:
num_train_epochs: 3
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 0.0001
Performance
The model uses several optimizations for efficient training:
- DeepSpeed ZeRO Stage-2 for memory optimization
- FP16 mixed precision training
- Gradient accumulation for larger effective batch sizes
- Efficient data streaming for handling large datasets
Outputs
Generating text from multiple prompts:
==================================================
Prompt: Once upon a time
Generated: Once upon a time time stood stoodDeliveryDelivery soot soototionotion caring caringHeroHerocomingcomingServingServingSKSKLocalLocalAlexanderAlexander Classroom Classroom advantage advantage am am released releasedGameGameTITI Lewis LewisASPASP Corn CornGeorgiaGeorgiaчч BT BT residents residentsemiemiلل develops develops methods methodsPerhapsPerhapsItemsItemsScaleScale viewers viewersexpressionexpression evil evil phase phase info info¶¶ could could website website originally originally YOU YOUHelloHello Sel Sel Lucknow LucknowPadPadratesrateslotslots elbow elbow resolved resolved afternoons afternoons API API Kart Kart Lights Lights rents rents Steering SteeringAzAzEnglishEnglishAlanAlan loves loves dermatology dermatology aur aur applauded applauded cemetery cemetery sooner sooner9090 tribes tribes orbit orbitannelannel boil boil related related excerpt excerpt inactivity inactivity allergies allergies provide provide enjoying enjoyingTransformingTransformingAlwaysAlwayslookinglooking noted noted unveil unveil distorted distorted Inter InterThankThankWasWasAroundAround vs vs Massage Massagepitpit Allowance Allowance Tact Tact Signing Signing Priest Priest Independence Independence protestors protestorsurdurd PM PM 2050 2050 expelled
--------------------------------------------------
Prompt: The meaning of life is
Generated: The meaning of life is is her herMexicoMexico angle angleeticeticitcheritcher Armed Armedocideocide Despite Despite group group frames frames gru gru called called Identity Identity covered coveredahah CBC CBCverevere knives knives perfect perfectMuMuServiceService majesty majesty Shaw Shaw said saidAllAll your your method method � � underlying underlying cities cities Workforce Workforce scratch scratch Ow Ow Lar Lariconicon discerning discerning Highlight Highlight invigorating invigoratingVisitingVisiting Recommended Recommended hub hub studies studies Mrs Mrs DN DN drawing drawing Started Started Structural Structural Kennedy Kennedymountedmounted liposuction liposuction Louis Louis hairc haircprisingprisingatchesatches copied copied Far Far coolers coolers invaded invaded UNHCR UNHCR attic attic mile mileonomicsonomicsobeobe Canary Canary Hezbollah HezbollahRegRegReducedReducedGuestGuestieliel build buildlatedlated Keyword Keyword Liberal LiberalArArActuallyActually Snacks Snacks vivid vivid Attack Attackmediamedia Dav Dav bull bullwegweg yeast yeastizableizableDRDRHLHLdiscdiscgargar Til Til Ensure Ensure has has always always Mustang Mustang warranted warranted Mean Mean Color Color blowing blowing
--------------------------------------------------
Prompt: In the distant future
Generated: In the distant future futureFlexibleFlexibleCivilCivil oste oste 1974 1974 Grain Grain Heating Heating id idaddadd square square ob ob Boosts BoostsHealthcareHealthcare Lamp Lamp woke woke Ger Ger tend tend rugby rugby breath breath Allow Allow COUN COUNwifewifenyny Guardians Guardians satisfied satisfied anonymous anonymous drier drier Introducing Introducing tips tips Creations Creationsphasisphasis procuring procuring Authenticity Authenticity Le LeocateocateBetterBetter open open Julia Julia HOW HOW eligibility eligibilityAffordableAffordableWhateverWhateverUnUn Trophy TrophyFullFull Reasons Reasons Pair Pairatureature Mosaic Mosaicnarnar firing firing mineral mineral Harvard Harvard Glob Globaluatingaluating penny pennyieberieber Chargers Chargers labeled labeled trafficking trafficking Continue Continue Rating Rating worlds worldshanahanaADAADA Wide WideIngredientsIngredientsomachomachertingertingaceyacey Depression Depressionricanrican Pierre Pierre limiting limiting scripted scriptedinxinx antenna antenna Stewardship Stewardship Toll Toll After After Resilience Resilience controversial controversial Quite Quite Unit Unit Before Before Nom Nomomiomi Allison AllisonownedownedProductsProducts Project Project trash trash testament testament plan plan slow slowWindWindandoando Term Term hesitate
--------------------------------------------------
Prompt: The best way to learn programming is
Generated: The best way to learn programming is is XS XS fee fee3939 trained trained Mail Mail Writing Writing techniques techniques diagrams diagrams Western Western onboarding onboarding Activ Activ Standing Standing provides provides Bar Bar Your YourFailureFailure Aging Aging Holmes Holmes regime regimegggg Rest Rest RTP RTP Glass Glass Statistical Statistical Click Click delivered delivered eg eg markers markersaucomaaucomaspectspect lent lentitaitaFromFromPriorPriorursesurses codes codes ph ph gener gener quantity quantity Lavender Lavenderstorestore billing billing organized organizedfiresfiresVegasVegas typical typical SAS SAS�� if if photos photos regularly regularly Compliance Compliance percent percentiasias 2020 2020 panic panicholmholm Although Although canada canada Categories Categories Bites BitesOurOur World World Ti Ti cynical cynical Financial Financial review review own own into into Teresa TeresaSweSweGRGR gemstone gemstonetexttext bourbon bourbon committed committed films filmsCleCleoesoesiffiff dining diningiguousiguous final finalashash media mediaJamJamsubssubsCANCAN elegant elegantxxrainingraining Obviously Obviously cholesterol cholesterolleggedlegged favors favors 5 5
--------------------------------------------------
Prompt: Today I learned that
Generated: Today I learned that that taking taking Thursday Thursday Workforce Workforce slopes slopes depictions depictions Dean Deanessess added addedashesashesSimpleSimple supporting supportingEggEggCustomCustom apologies apologies closures closures not not through through not Fred Fred THC THCilsils mutual mutualodyody cork corkSustainabilitySustainability Accessories Accessories East East Vit Vit 3 3 Wii Wii LMS LMS Make Make Auto Auto roasting roasting costs costs precarious precarious Maya Maya worked worked Therefore Thereforeirectionirectionroadroad2020 Al AlInvestmentInvestment Notification Notification€€ThusThus Versatility Versatilityhusihusi Printer Printer organisations organisations Wright Wrightocalypseocalypse Generic Generic Chow Chow compact compactee Painting Painting fifty fiftyFromFromTurnTurn maintenance maintenance Will WillMyMyPsychPsych anim anim Admissions Admissions Betting Betting inaugurated inaugurated preceded preceded study study successful successful complementing complementingisiisiindersinders evacu evacu finely finely extent extent ru ru Min MinIndustryIndustry western westernkarkar salads salads USB USB Jay Jay NPR NPRcorrectcorrectogenesisogenesis Help Help staunch staunch magnificent magnificentQCQCABABgotgotCaringCaringGoogleGoogle nonetheless nonetheless
--------------------------------------------------
- Downloads last month
- 3
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.