SchorbGPT-Medium
This is a medium sized language model trained on web data. The model uses the GPT-2 architecture and tokenizer.
Model Details
- Model Type: GPT-2
- Training Data: Web text data
- Number of Parameters: GPT-2 medium scale
- Context Length: 512 tokens
- Training Framework: PyTorch
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("iimaginary/schorbGPT-medium")
model = AutoModelForCausalLM.from_pretrained("iimaginary/schorbGPT-medium")
text = "Your prompt here"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Performance and Model Analysis
Zero-shot Evaluation Results
Task | Metric | Value | Stderr |
---|---|---|---|
WikiText | bits_per_byte | 0.9860 | N/A |
WikiText | byte_perplexity | 1.9806 | N/A |
WikiText | word_perplexity | 38.6497 | N/A |
ARC Easy | accuracy | 48.02% | ±1.03% |
ARC Easy | accuracy (normalized) | 42.17% | ±1.01% |
HellaSwag | accuracy | 29.06% | ±0.45% |
HellaSwag | accuracy (normalized) | 31.26% | ±0.46% |
LAMBADA | accuracy | 33.90% | ±0.66% |
LAMBADA | perplexity | 36.2055 | ±1.4052 |
PIQA | accuracy | 61.92% | ±1.13% |
PIQA | accuracy (normalized) | 62.46% | ±1.13% |
Winogrande | accuracy | 50.59% | ±1.41% |
Analysis and Comparisons
Language Modeling Performance
The model achieves a word perplexity of 38.65 on WikiText, which is competitive with similar-sized models. For comparison:
- Original GPT-2 (small): ~35-40 perplexity
- GPT-2 medium: ~30-35 perplexity
- BERT-base: ~40-45 perplexity
Task-Specific Analysis:
Physical and Commonsense Reasoning:
- PIQA: 61.92% (Random baseline: 50%)
- Comparable to GPT-2 small/medium performance
- Shows good physical commonsense understanding
Science Knowledge:
- ARC Easy: 48.02% (Random baseline: 25%)
- Above random chance and demonstrates basic scientific knowledge
- Similar to performance seen in early GPT-2 variants
Linguistic Understanding:
- LAMBADA: 33.90% accuracy with perplexity of 36.21
- HellaSwag: 31.26% (Random baseline: 25%)
- Performance indicates basic linguistic and contextual understanding
- Typical range for non-fine-tuned models of this scale
Reasoning and Logic:
- Winogrande: 50.59% (Random baseline: 50%)
- At par with random chance, suggesting room for improvement in complex reasoning tasks
- Common for base models without specific fine-tuning
Strengths and Limitations
Strengths:
- Strong performance on physical commonsense (PIQA)
- Decent basic science knowledge (ARC Easy)
- Competitive language modeling metrics
Limitations:
- Limited complex reasoning capabilities (Winogrande)
- Basic linguistic understanding could be improved (LAMBADA, HellaSwag)
- Performance typical of base models without task-specific fine-tuning
Limitations
This is a base model without fine-tuning or alignment. It should be used with appropriate consideration of its capabilities and limitations.
- Downloads last month
- 7