|
--- |
|
language: |
|
- en |
|
tags: |
|
- gpt2 |
|
- text-generation |
|
- pytorch |
|
license: mit |
|
--- |
|
|
|
# SchorbGPT-Medium |
|
|
|
This is a medium sized language model trained on web data. The model uses the GPT-2 architecture and tokenizer. |
|
|
|
## Model Details |
|
|
|
- Model Type: GPT-2 |
|
- Training Data: Web text data |
|
- Number of Parameters: GPT-2 medium scale |
|
- Context Length: 512 tokens |
|
- Training Framework: PyTorch |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("iimaginary/schorbGPT-medium") |
|
model = AutoModelForCausalLM.from_pretrained("iimaginary/schorbGPT-medium") |
|
|
|
text = "Your prompt here" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_length=100) |
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Performance and Model Analysis |
|
|
|
### Zero-shot Evaluation Results |
|
|
|
| Task | Metric | Value | Stderr | |
|
|------|--------|-------|--------| |
|
| WikiText | bits_per_byte | 0.9860 | N/A | |
|
| WikiText | byte_perplexity | 1.9806 | N/A | |
|
| WikiText | word_perplexity | 38.6497 | N/A | |
|
| ARC Easy | accuracy | 48.02% | ±1.03% | |
|
| ARC Easy | accuracy (normalized) | 42.17% | ±1.01% | |
|
| HellaSwag | accuracy | 29.06% | ±0.45% | |
|
| HellaSwag | accuracy (normalized) | 31.26% | ±0.46% | |
|
| LAMBADA | accuracy | 33.90% | ±0.66% | |
|
| LAMBADA | perplexity | 36.2055 | ±1.4052 | |
|
| PIQA | accuracy | 61.92% | ±1.13% | |
|
| PIQA | accuracy (normalized) | 62.46% | ±1.13% | |
|
| Winogrande | accuracy | 50.59% | ±1.41% | |
|
|
|
### Analysis and Comparisons |
|
|
|
#### Language Modeling Performance |
|
The model achieves a word perplexity of 38.65 on WikiText, which is competitive with similar-sized models. For comparison: |
|
- Original GPT-2 (small): ~35-40 perplexity |
|
- GPT-2 medium: ~30-35 perplexity |
|
- BERT-base: ~40-45 perplexity |
|
|
|
#### Task-Specific Analysis: |
|
|
|
1. Physical and Commonsense Reasoning: |
|
- PIQA: 61.92% (Random baseline: 50%) |
|
- Comparable to GPT-2 small/medium performance |
|
- Shows good physical commonsense understanding |
|
|
|
2. Science Knowledge: |
|
- ARC Easy: 48.02% (Random baseline: 25%) |
|
- Above random chance and demonstrates basic scientific knowledge |
|
- Similar to performance seen in early GPT-2 variants |
|
|
|
3. Linguistic Understanding: |
|
- LAMBADA: 33.90% accuracy with perplexity of 36.21 |
|
- HellaSwag: 31.26% (Random baseline: 25%) |
|
- Performance indicates basic linguistic and contextual understanding |
|
- Typical range for non-fine-tuned models of this scale |
|
|
|
4. Reasoning and Logic: |
|
- Winogrande: 50.59% (Random baseline: 50%) |
|
- At par with random chance, suggesting room for improvement in complex reasoning tasks |
|
- Common for base models without specific fine-tuning |
|
|
|
### Strengths and Limitations |
|
|
|
**Strengths:** |
|
- Strong performance on physical commonsense (PIQA) |
|
- Decent basic science knowledge (ARC Easy) |
|
- Competitive language modeling metrics |
|
|
|
**Limitations:** |
|
- Limited complex reasoning capabilities (Winogrande) |
|
- Basic linguistic understanding could be improved (LAMBADA, HellaSwag) |
|
- Performance typical of base models without task-specific fine-tuning |
|
|
|
## Limitations |
|
|
|
This is a base model without fine-tuning or alignment. It should be used with appropriate consideration of its capabilities and limitations. |
|
|