FischGPT-SFT
Model Description
FischGPT-SFT is a supervised fine-tuned GPT-2 style transformer model built completely from scratch using PyTorch. This implementation demonstrates deep understanding of transformer architecture and industry-standard training practices.
Key Features
- From-scratch implementation: Every component built without using pre-existing transformer libraries
- Flash Attention: Implements efficient attention using
F.scaled_dot_product_attention
- Professional Architecture: Clean separation of attention, MLP, and transformer blocks
- Industry Training: Follows OpenAI's GPT-2 training methodology
- Production Ready: Includes proper weight initialization and distributed training support
Model Architecture
Parameter | Value |
---|---|
Model Type | GPT-2 Style Decoder |
Layers | 12 |
Hidden Size | 768 |
Attention Heads | 12 |
Context Length | 1024 |
Vocabulary Size | 50,304 |
Parameters | ~124M |
Training Details
- Model Type: Supervised Fine-Tuned
- Training Data: OpenAssistant/oasst1
- Training Steps: 19,999 steps
- Final Validation Loss: 1.725750207901001
- Tokenizer: GPT-2 BPE (tiktoken)
- Framework: PyTorch with mixed precision (bfloat16)
Training Infrastructure
- Distributed Training: Multi-GPU support with DistributedDataParallel
- Optimization: AdamW with cosine learning rate schedule
- Regularization: Weight decay, dropout, gradient clipping
- Monitoring: Comprehensive logging and checkpoint management
Usage
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained('fischgpt-sft')
tokenizer = GPT2Tokenizer.from_pretrained('fischgpt-sft')
# Generate text
input_text = "The future of artificial intelligence"
inputs = tokenizer.encode(input_text, return_tensors='pt')
with torch.no_grad():
outputs = model.generate(
inputs,
max_length=100,
num_return_sequences=1,
temperature=0.8,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Chat Format (for SFT models)
def chat_format(user_message):
return f"<|user|>{user_message}<|assistant|>"
prompt = chat_format("Explain quantum computing in simple terms")
# ... generate as above
Implementation Highlights
Custom Components
- CasualSelfAttention: Implements multi-head self-attention with causal masking
- MLP: Feed-forward network with GELU activation and custom initialization
- Block: Transformer block with pre-layer normalization
- GPT: Complete model with tied embeddings and generation capabilities
Advanced Features
# Flash Attention Implementation
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
# Custom Weight Initialization
if hasattr(module, "FISCHGPT_SCALE_INIT"):
std *= (2 * self.config.n_layer) ** -.5
Performance & Benchmarks
Metric | Value |
---|---|
Training Speed | ~1200000 tokens/sec |
Memory Efficiency | Mixed precision (bfloat16) |
Context Length | 1024 tokens |
Generation Speed | Fast inference with optimized attention |
Technical Specifications
- Attention Pattern: Causal (autoregressive)
- Activation Function: GELU (approximate='tanh')
- Normalization: Layer Normalization
- Position Encoding: Learned positional embeddings
- Weight Tying: Shared input/output embeddings
Use Cases
- Conversational AI, Instruction Following
- Code completion and programming assistance
- Creative writing and storytelling
- Educational content generation
- Research and experimentation
Limitations
- Context length limited to 1024 tokens
- English-focused training data
- Requires careful prompt engineering for best results
- May generate inconsistent or incorrect information
Ethics and Safety
This model was trained on publicly available datasets and may reflect biases present in the training data. Users should:
- Validate generated content for accuracy
- Be aware of potential biases in outputs
- Use appropriate content filtering for production applications
- Follow responsible AI practices
Citation
@misc{fischgpt2024,
title={FischGPT: A From-Scratch GPT-2 Implementation},
author={[Your Name]},
year={2024},
howpublished={\url{https://github.com/yourusername/FischGPT}}
}
License
MIT License - See LICENSE file for details.
Built with industry best practices and attention to detail. This implementation showcases deep understanding of transformer architecture and modern NLP engineering.
- Downloads last month
- 62
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support