File size: 2,257 Bytes
248e07a
 
 
 
 
 
c00127a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248e07a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
datasets:
- DIBT/10k_prompts_ranked
language:
- en
---
Model Card: Uplimit Project 1 part 1
Model Description:
This is a model to test run publishing models. It has no real model assessment value.

This is a Large Language Model (LLM) trained on a dataset of DIBT/10k_prompts_ranked.
It was evaluated using using Eleuther Evaluation Harness

Hellaswag
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)

|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.2872|±  |0.0045|
|         |       |none  |     0|acc_norm|↑  |0.3082|±  |0.0046|

Interpretation (curtesy for Prepelexit.ai):

Accuracy Metrics:
Standard accuracy: 0.2872 (28.72%)
Normalized accuracy: 0.3082 (30.82%)

Context:
The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way.
The task is considered difficult even for larger language models.

Interpretation
Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.

Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.

Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.

Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.

Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training