SchorbGPT-Medium

This is a medium sized language model trained on web data. The model uses the GPT-2 architecture and tokenizer.

Model Details

Model Type: GPT-2
Training Data: Web text data
Number of Parameters: GPT-2 medium scale
Context Length: 512 tokens
Training Framework: PyTorch

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("iimaginary/schorbGPT-medium")
model = AutoModelForCausalLM.from_pretrained("iimaginary/schorbGPT-medium")

text = "Your prompt here"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Performance and Model Analysis

Zero-shot Evaluation Results

Task	Metric	Value	Stderr
WikiText	bits_per_byte	0.9860	N/A
WikiText	byte_perplexity	1.9806	N/A
WikiText	word_perplexity	38.6497	N/A
ARC Easy	accuracy	48.02%	±1.03%
ARC Easy	accuracy (normalized)	42.17%	±1.01%
HellaSwag	accuracy	29.06%	±0.45%
HellaSwag	accuracy (normalized)	31.26%	±0.46%
LAMBADA	accuracy	33.90%	±0.66%
LAMBADA	perplexity	36.2055	±1.4052
PIQA	accuracy	61.92%	±1.13%
PIQA	accuracy (normalized)	62.46%	±1.13%
Winogrande	accuracy	50.59%	±1.41%

Analysis and Comparisons

Language Modeling Performance

The model achieves a word perplexity of 38.65 on WikiText, which is competitive with similar-sized models. For comparison:

Original GPT-2 (small): ~35-40 perplexity
GPT-2 medium: ~30-35 perplexity
BERT-base: ~40-45 perplexity

Task-Specific Analysis:

Physical and Commonsense Reasoning:
- PIQA: 61.92% (Random baseline: 50%)
- Comparable to GPT-2 small/medium performance
- Shows good physical commonsense understanding
Science Knowledge:
- ARC Easy: 48.02% (Random baseline: 25%)
- Above random chance and demonstrates basic scientific knowledge
- Similar to performance seen in early GPT-2 variants
Linguistic Understanding:
- LAMBADA: 33.90% accuracy with perplexity of 36.21
- HellaSwag: 31.26% (Random baseline: 25%)
- Performance indicates basic linguistic and contextual understanding
- Typical range for non-fine-tuned models of this scale
Reasoning and Logic:
- Winogrande: 50.59% (Random baseline: 50%)
- At par with random chance, suggesting room for improvement in complex reasoning tasks
- Common for base models without specific fine-tuning

Strengths and Limitations

Strengths:

Strong performance on physical commonsense (PIQA)
Decent basic science knowledge (ARC Easy)
Competitive language modeling metrics

Limitations:

Limited complex reasoning capabilities (Winogrande)
Basic linguistic understanding could be improved (LAMBADA, HellaSwag)
Performance typical of base models without task-specific fine-tuning

Limitations

This is a base model without fine-tuning or alignment. It should be used with appropriate consideration of its capabilities and limitations.