You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FischGPT-SFT

Model Description

FischGPT-SFT is a supervised fine-tuned GPT-2 style transformer model built completely from scratch using PyTorch. This implementation demonstrates deep understanding of transformer architecture and industry-standard training practices.

Key Features

  • From-scratch implementation: Every component built without using pre-existing transformer libraries
  • Flash Attention: Implements efficient attention using F.scaled_dot_product_attention
  • Professional Architecture: Clean separation of attention, MLP, and transformer blocks
  • Industry Training: Follows OpenAI's GPT-2 training methodology
  • Production Ready: Includes proper weight initialization and distributed training support

Model Architecture

Parameter Value
Model Type GPT-2 Style Decoder
Layers 12
Hidden Size 768
Attention Heads 12
Context Length 1024
Vocabulary Size 50,304
Parameters ~124M

Training Details

  • Model Type: Supervised Fine-Tuned
  • Training Data: OpenAssistant/oasst1
  • Training Steps: 19,999 steps
  • Final Validation Loss: 1.725750207901001
  • Tokenizer: GPT-2 BPE (tiktoken)
  • Framework: PyTorch with mixed precision (bfloat16)

Training Infrastructure

  • Distributed Training: Multi-GPU support with DistributedDataParallel
  • Optimization: AdamW with cosine learning rate schedule
  • Regularization: Weight decay, dropout, gradient clipping
  • Monitoring: Comprehensive logging and checkpoint management

Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained('fischgpt-sft')
tokenizer = GPT2Tokenizer.from_pretrained('fischgpt-sft')

# Generate text
input_text = "The future of artificial intelligence"
inputs = tokenizer.encode(input_text, return_tensors='pt')

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_length=100,
        num_return_sequences=1,
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Chat Format (for SFT models)

def chat_format(user_message):
    return f"<|user|>{user_message}<|assistant|>"

prompt = chat_format("Explain quantum computing in simple terms")
# ... generate as above

Implementation Highlights

Custom Components

  • CasualSelfAttention: Implements multi-head self-attention with causal masking
  • MLP: Feed-forward network with GELU activation and custom initialization
  • Block: Transformer block with pre-layer normalization
  • GPT: Complete model with tied embeddings and generation capabilities

Advanced Features

# Flash Attention Implementation
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

# Custom Weight Initialization
if hasattr(module, "FISCHGPT_SCALE_INIT"):
    std *= (2 * self.config.n_layer) ** -.5

Performance & Benchmarks

Metric Value
Training Speed ~1200000 tokens/sec
Memory Efficiency Mixed precision (bfloat16)
Context Length 1024 tokens
Generation Speed Fast inference with optimized attention

Technical Specifications

  • Attention Pattern: Causal (autoregressive)
  • Activation Function: GELU (approximate='tanh')
  • Normalization: Layer Normalization
  • Position Encoding: Learned positional embeddings
  • Weight Tying: Shared input/output embeddings

Use Cases

  • Conversational AI, Instruction Following
  • Code completion and programming assistance
  • Creative writing and storytelling
  • Educational content generation
  • Research and experimentation

Limitations

  • Context length limited to 1024 tokens
  • English-focused training data
  • Requires careful prompt engineering for best results
  • May generate inconsistent or incorrect information

Ethics and Safety

This model was trained on publicly available datasets and may reflect biases present in the training data. Users should:

  • Validate generated content for accuracy
  • Be aware of potential biases in outputs
  • Use appropriate content filtering for production applications
  • Follow responsible AI practices

Citation

@misc{fischgpt2024,
  title={FischGPT: A From-Scratch GPT-2 Implementation},
  author={[Your Name]},
  year={2024},
  howpublished={\url{https://github.com/yourusername/FischGPT}}
}

License

MIT License - See LICENSE file for details.


Built with industry best practices and attention to detail. This implementation showcases deep understanding of transformer architecture and modern NLP engineering.

Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using kristianfischerai12345/fischgpt-sft 1