rs_uplimit1_model / README.md
rshacter's picture
Update README.md
248e07a verified
---
datasets:
- DIBT/10k_prompts_ranked
language:
- en
---
Model Card: Uplimit Project 1 part 1
Model Description:
This is a model to test run publishing models. It has no real model assessment value.
This is a Large Language Model (LLM) trained on a dataset of DIBT/10k_prompts_ranked.
It was evaluated using using Eleuther Evaluation Harness
Hellaswag
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045|
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046|
Interpretation (curtesy for Prepelexit.ai):
Accuracy Metrics:
Standard accuracy: 0.2872 (28.72%)
Normalized accuracy: 0.3082 (30.82%)
Context:
The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way.
The task is considered difficult even for larger language models.
Interpretation
Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.
Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.
Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.
Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.
Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training