|
--- |
|
datasets: |
|
- DIBT/10k_prompts_ranked |
|
language: |
|
- en |
|
--- |
|
Model Card: Uplimit Project 1 part 1 |
|
Model Description: |
|
This is a model to test run publishing models. It has no real model assessment value. |
|
|
|
This is a Large Language Model (LLM) trained on a dataset of DIBT/10k_prompts_ranked. |
|
It was evaluated using using Eleuther Evaluation Harness |
|
|
|
Hellaswag |
|
Passed argument batch_size = auto:4.0. Detecting largest batch size |
|
Determined largest batch size: 64 |
|
Passed argument batch_size = auto:4.0. Detecting largest batch size |
|
Determined largest batch size: 64 |
|
hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64) |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|---------|------:|------|-----:|--------|---|-----:|---|-----:| |
|
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045| |
|
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046| |
|
|
|
Interpretation (curtesy for Prepelexit.ai): |
|
|
|
Accuracy Metrics: |
|
Standard accuracy: 0.2872 (28.72%) |
|
Normalized accuracy: 0.3082 (30.82%) |
|
|
|
Context: |
|
The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way. |
|
The task is considered difficult even for larger language models. |
|
|
|
Interpretation |
|
Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1. |
|
|
|
Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1. |
|
|
|
Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2. |
|
|
|
Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2. |
|
|
|
Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training |