SchorbGPT-Medium

This is a medium sized language model trained on web data. The model uses the GPT-2 architecture and tokenizer.

Model Details

  • Model Type: GPT-2
  • Training Data: Web text data
  • Number of Parameters: GPT-2 medium scale
  • Context Length: 512 tokens
  • Training Framework: PyTorch

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("iimaginary/schorbGPT-medium")
model = AutoModelForCausalLM.from_pretrained("iimaginary/schorbGPT-medium")

text = "Your prompt here"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Performance and Model Analysis

Zero-shot Evaluation Results

Task Metric Value Stderr
WikiText bits_per_byte 0.9860 N/A
WikiText byte_perplexity 1.9806 N/A
WikiText word_perplexity 38.6497 N/A
ARC Easy accuracy 48.02% ±1.03%
ARC Easy accuracy (normalized) 42.17% ±1.01%
HellaSwag accuracy 29.06% ±0.45%
HellaSwag accuracy (normalized) 31.26% ±0.46%
LAMBADA accuracy 33.90% ±0.66%
LAMBADA perplexity 36.2055 ±1.4052
PIQA accuracy 61.92% ±1.13%
PIQA accuracy (normalized) 62.46% ±1.13%
Winogrande accuracy 50.59% ±1.41%

Analysis and Comparisons

Language Modeling Performance

The model achieves a word perplexity of 38.65 on WikiText, which is competitive with similar-sized models. For comparison:

  • Original GPT-2 (small): ~35-40 perplexity
  • GPT-2 medium: ~30-35 perplexity
  • BERT-base: ~40-45 perplexity

Task-Specific Analysis:

  1. Physical and Commonsense Reasoning:

    • PIQA: 61.92% (Random baseline: 50%)
    • Comparable to GPT-2 small/medium performance
    • Shows good physical commonsense understanding
  2. Science Knowledge:

    • ARC Easy: 48.02% (Random baseline: 25%)
    • Above random chance and demonstrates basic scientific knowledge
    • Similar to performance seen in early GPT-2 variants
  3. Linguistic Understanding:

    • LAMBADA: 33.90% accuracy with perplexity of 36.21
    • HellaSwag: 31.26% (Random baseline: 25%)
    • Performance indicates basic linguistic and contextual understanding
    • Typical range for non-fine-tuned models of this scale
  4. Reasoning and Logic:

    • Winogrande: 50.59% (Random baseline: 50%)
    • At par with random chance, suggesting room for improvement in complex reasoning tasks
    • Common for base models without specific fine-tuning

Strengths and Limitations

Strengths:

  • Strong performance on physical commonsense (PIQA)
  • Decent basic science knowledge (ARC Easy)
  • Competitive language modeling metrics

Limitations:

  • Limited complex reasoning capabilities (Winogrande)
  • Basic linguistic understanding could be improved (LAMBADA, HellaSwag)
  • Performance typical of base models without task-specific fine-tuning

Limitations

This is a base model without fine-tuning or alignment. It should be used with appropriate consideration of its capabilities and limitations.

Downloads last month
7
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .