rshacter
/

rs_uplimit1_model

Model card Files Files and versions Community

rs_uplimit1_model / README.md

rshacter's picture

Update README.md

248e07a verified 9 months ago

|

history blame contribute delete

2.26 kB

	---
	datasets:
	- DIBT/10k_prompts_ranked
	language:
	- en
	---
	Model Card: Uplimit Project 1 part 1
	Model Description:
	This is a model to test run publishing models. It has no real model assessment value.

	This is a Large Language Model (LLM) trained on a dataset of DIBT/10k_prompts_ranked.
	It was evaluated using using Eleuther Evaluation Harness

	Hellaswag
	Passed argument batch_size = auto:4.0. Detecting largest batch size
	Determined largest batch size: 64
	Passed argument batch_size = auto:4.0. Detecting largest batch size
	Determined largest batch size: 64
	hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)

	\| Tasks \|Version\|Filter\|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|---------\|------:\|------\|-----:\|--------\|---\|-----:\|---\|-----:\|
	\|hellaswag\| 1\|none \| 0\|acc \|↑ \|0.2872\|± \|0.0045\|
	\| \| \|none \| 0\|acc_norm\|↑ \|0.3082\|± \|0.0046\|

	Interpretation (curtesy for Prepelexit.ai):

	Accuracy Metrics:
	Standard accuracy: 0.2872 (28.72%)
	Normalized accuracy: 0.3082 (30.82%)

	Context:
	The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way.
	The task is considered difficult even for larger language models.

	Interpretation
	Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.

	Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.

	Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.

	Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.

	Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training