noneUsername/Cydonia-24B-v2-W8A8-Defective

vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.912	±	0.018
		strict-match	5	exact_match	↑	0.912	±	0.018

vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.904	±	0.0132
		strict-match	5	exact_match	↑	0.894	±	0.0138

vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7942	±	0.0131
- humanities	2	none	acc	↑	0.8205	±	0.0257
- other	2	none	acc	↑	0.8103	±	0.0271
- social sciences	2	none	acc	↑	0.8500	±	0.0257
- stem	2	none	acc	↑	0.7298	±	0.0249

vllm (pretrained=/root/autodl-tmp/80-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.836	±	0.0235
		strict-match	5	exact_match	↑	0.828	±	0.0239

vllm (pretrained=/root/autodl-tmp/80-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.864	±	0.0153
		strict-match	5	exact_match	↑	0.840	±	0.0164

vllm (pretrained=/root/autodl-tmp/80-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7789	±	0.0135
- humanities	2	none	acc	↑	0.8000	±	0.0266
- other	2	none	acc	↑	0.7846	±	0.0280
- social sciences	2	none	acc	↑	0.8444	±	0.0258
- stem	2	none	acc	↑	0.7193	±	0.0257

vllm (pretrained=/root/autodl-tmp/80-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.852	±	0.0225
		strict-match	5	exact_match	↑	0.824	±	0.0241

vllm (pretrained=/root/autodl-tmp/80-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.860	±	0.0155
		strict-match	5	exact_match	↑	0.816	±	0.0173

vllm (pretrained=/root/autodl-tmp/80-512-df10,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7778	±	0.0135
- humanities	2	none	acc	↑	0.8103	±	0.0261
- other	2	none	acc	↑	0.7846	±	0.0280
- social sciences	2	none	acc	↑	0.8500	±	0.0259
- stem	2	none	acc	↑	0.7053	±	0.0261

vllm (pretrained=/root/autodl-tmp/83-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.860	±	0.0220
		strict-match	5	exact_match	↑	0.852	±	0.0225

vllm (pretrained=/root/autodl-tmp/83-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.864	±	0.0153
		strict-match	5	exact_match	↑	0.844	±	0.0162

vllm (pretrained=/root/autodl-tmp/83-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.8051	±	0.0268
- other	2	none	acc	↑	0.7795	±	0.0275
- social sciences	2	none	acc	↑	0.8556	±	0.0255
- stem	2	none	acc	↑	0.6982	±	0.0260

vllm (pretrained=/root/autodl-tmp/84-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.864	±	0.0217
		strict-match	5	exact_match	↑	0.848	±	0.0228

vllm (pretrained=/root/autodl-tmp/84-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.870	±	0.0151
		strict-match	5	exact_match	↑	0.844	±	0.0162

vllm (pretrained=/root/autodl-tmp/84-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7754	±	0.0134
- humanities	2	none	acc	↑	0.8000	±	0.0269
- other	2	none	acc	↑	0.7897	±	0.0278
- social sciences	2	none	acc	↑	0.8667	±	0.0248
- stem	2	none	acc	↑	0.6912	±	0.0259

vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.856	±	0.0222
		strict-match	5	exact_match	↑	0.840	±	0.0232

vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0150
		strict-match	5	exact_match	↑	0.840	±	0.0164

vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.8000	±	0.0259
- other	2	none	acc	↑	0.7846	±	0.0280
- social sciences	2	none	acc	↑	0.8611	±	0.0251
- stem	2	none	acc	↑	0.6947	±	0.0262

vllm (pretrained=/root/autodl-tmp/86-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.868	±	0.0215
		strict-match	5	exact_match	↑	0.856	±	0.0222

vllm (pretrained=/root/autodl-tmp/86-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.88	±	0.0145
		strict-match	5	exact_match	↑	0.85	±	0.0160

vllm (pretrained=/root/autodl-tmp/86-1024,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7731	±	0.0134
- humanities	2	none	acc	↑	0.8051	±	0.0262
- other	2	none	acc	↑	0.7795	±	0.0275
- social sciences	2	none	acc	↑	0.8444	±	0.0261
- stem	2	none	acc	↑	0.7018	±	0.0256

vllm (pretrained=/root/autodl-tmp/865-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0206
		strict-match	5	exact_match	↑	0.868	±	0.0215

vllm (pretrained=/root/autodl-tmp/865-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.862	±	0.0154
		strict-match	5	exact_match	↑	0.832	±	0.0167

vllm (pretrained=/root/autodl-tmp/865-1024,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7719	±	0.0135
- humanities	2	none	acc	↑	0.8051	±	0.0265
- other	2	none	acc	↑	0.7744	±	0.0277
- social sciences	2	none	acc	↑	0.8444	±	0.0263
- stem	2	none	acc	↑	0.7018	±	0.0261

vllm (pretrained=/root/autodl-tmp/8675-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.876	±	0.0209
		strict-match	5	exact_match	↑	0.852	±	0.0225

vllm (pretrained=/root/autodl-tmp/8675-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.878	±	0.0147
		strict-match	5	exact_match	↑	0.854	±	0.0158

vllm (pretrained=/root/autodl-tmp/8675-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7836	±	0.0132
- humanities	2	none	acc	↑	0.7949	±	0.0263
- other	2	none	acc	↑	0.7949	±	0.0270
- social sciences	2	none	acc	↑	0.8556	±	0.0254
- stem	2	none	acc	↑	0.7228	±	0.0255

vllm (pretrained=/root/autodl-tmp/8675-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0206
		strict-match	5	exact_match	↑	0.872	±	0.0212

vllm (pretrained=/root/autodl-tmp/8675-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.886	±	0.0142
		strict-match	5	exact_match	↑	0.858	±	0.0156

vllm (pretrained=/root/autodl-tmp/8675-3048,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7719	±	0.0134
- humanities	2	none	acc	↑	0.7846	±	0.0264
- other	2	none	acc	↑	0.7846	±	0.0278
- social sciences	2	none	acc	↑	0.8556	±	0.0251
- stem	2	none	acc	↑	0.7018	±	0.0259

vllm (pretrained=/root/autodl-tmp/86875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.888	±	0.0200
		strict-match	5	exact_match	↑	0.880	±	0.0206

vllm (pretrained=/root/autodl-tmp/86875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.876	±	0.0148
		strict-match	5	exact_match	↑	0.850	±	0.0160

vllm (pretrained=/root/autodl-tmp/86875-1024,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.8154	±	0.0260
- other	2	none	acc	↑	0.7795	±	0.0280
- social sciences	2	none	acc	↑	0.8500	±	0.0257
- stem	2	none	acc	↑	0.6947	±	0.0258

vllm (pretrained=/root/autodl-tmp/86875-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.884	±	0.0203
		strict-match	5	exact_match	↑	0.876	±	0.0209

vllm (pretrained=/root/autodl-tmp/86875-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0145
		strict-match	5	exact_match	↑	0.856	±	0.0157

vllm (pretrained=/root/autodl-tmp/86875-3048,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7731	±	0.0134
- humanities	2	none	acc	↑	0.8205	±	0.0256
- other	2	none	acc	↑	0.7897	±	0.0281
- social sciences	2	none	acc	↑	0.8278	±	0.0271
- stem	2	none	acc	↑	0.6947	±	0.0252

vllm (pretrained=/root/autodl-tmp/869-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.888	±	0.0200
		strict-match	5	exact_match	↑	0.880	±	0.0206

vllm (pretrained=/root/autodl-tmp/869-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.876	±	0.0148
		strict-match	5	exact_match	↑	0.850	±	0.0160

vllm (pretrained=/root/autodl-tmp/869-1024,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.8154	±	0.0260
- other	2	none	acc	↑	0.7795	±	0.0280
- social sciences	2	none	acc	↑	0.8500	±	0.0257
- stem	2	none	acc	↑	0.6947	±	0.0258

vllm (pretrained=/root/autodl-tmp/869-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.836	±	0.0235
		strict-match	5	exact_match	↑	0.820	±	0.0243

vllm (pretrained=/root/autodl-tmp/869-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.848	±	0.0161
		strict-match	5	exact_match	↑	0.822	±	0.0171

vllm (pretrained=/root/autodl-tmp/869-1536,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.7949	±	0.0266
- other	2	none	acc	↑	0.7846	±	0.0275
- social sciences	2	none	acc	↑	0.8500	±	0.0255
- stem	2	none	acc	↑	0.7053	±	0.0258

vllm (pretrained=/root/autodl-tmp/8695-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.860	±	0.0220
		strict-match	5	exact_match	↑	0.852	±	0.0225

vllm (pretrained=/root/autodl-tmp/8695-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.876	±	0.0148
		strict-match	5	exact_match	↑	0.848	±	0.0161

vllm (pretrained=/root/autodl-tmp/8695-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.8000	±	0.0260
- other	2	none	acc	↑	0.7795	±	0.0279
- social sciences	2	none	acc	↑	0.8500	±	0.0258
- stem	2	none	acc	↑	0.7053	±	0.0256

vllm (pretrained=/root/autodl-tmp/87-128,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.848	±	0.0228
		strict-match	5	exact_match	↑	0.836	±	0.0235

vllm (pretrained=/root/autodl-tmp/87-128,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.85	±	0.0160
		strict-match	5	exact_match	↑	0.83	±	0.0168

vllm (pretrained=/root/autodl-tmp/87-128,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7719	±	0.0134
- humanities	2	none	acc	↑	0.8000	±	0.0263
- other	2	none	acc	↑	0.7897	±	0.0271
- social sciences	2	none	acc	↑	0.8444	±	0.0261
- stem	2	none	acc	↑	0.6947	±	0.0262

vllm (pretrained=/root/autodl-tmp/87-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0212
		strict-match	5	exact_match	↑	0.856	±	0.0222

vllm (pretrained=/root/autodl-tmp/87-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.862	±	0.0154
		strict-match	5	exact_match	↑	0.840	±	0.0164

vllm (pretrained=/root/autodl-tmp/87-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7848	±	0.0134
- humanities	2	none	acc	↑	0.8154	±	0.0261
- other	2	none	acc	↑	0.8000	±	0.0276
- social sciences	2	none	acc	↑	0.8444	±	0.0261
- stem	2	none	acc	↑	0.7158	±	0.0260

vllm (pretrained=/root/autodl-tmp/87-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0206
		strict-match	5	exact_match	↑	0.868	±	0.0215

vllm (pretrained=/root/autodl-tmp/87-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0145
		strict-match	5	exact_match	↑	0.854	±	0.0158

vllm (pretrained=/root/autodl-tmp/87-1536,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7836	±	0.0131
- humanities	2	none	acc	↑	0.8359	±	0.0249
- other	2	none	acc	↑	0.7795	±	0.0279
- social sciences	2	none	acc	↑	0.8444	±	0.0259
- stem	2	none	acc	↑	0.7123	±	0.0251

vllm (pretrained=/root/autodl-tmp/87-1536-df01,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0212
		strict-match	5	exact_match	↑	0.864	±	0.0217

vllm (pretrained=/root/autodl-tmp/87-1536-df01,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0150
		strict-match	5	exact_match	↑	0.848	±	0.0161

vllm (pretrained=/root/autodl-tmp/87-1536-df01,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7708	±	0.0135
- humanities	2	none	acc	↑	0.8000	±	0.0265
- other	2	none	acc	↑	0.7744	±	0.0281
- social sciences	2	none	acc	↑	0.8444	±	0.0261
- stem	2	none	acc	↑	0.7018	±	0.0261

vllm (pretrained=/root/autodl-tmp/8701-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0206
		strict-match	5	exact_match	↑	0.868	±	0.0215

vllm (pretrained=/root/autodl-tmp/8701-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0145
		strict-match	5	exact_match	↑	0.854	±	0.0158

vllm (pretrained=/root/autodl-tmp/8701-1536,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7836	±	0.0131
- humanities	2	none	acc	↑	0.8359	±	0.0249
- other	2	none	acc	↑	0.7795	±	0.0279
- social sciences	2	none	acc	↑	0.8444	±	0.0259
- stem	2	none	acc	↑	0.7123	±	0.0251

vllm (pretrained=/root/autodl-tmp/87125-512-df2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.860	±	0.0220
		strict-match	5	exact_match	↑	0.848	±	0.0228

llm (pretrained=/root/autodl-tmp/87125-512-df2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.874	±	0.0149
		strict-match	5	exact_match	↑	0.852	±	0.0159

vllm (pretrained=/root/autodl-tmp/87125-512-df2,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0134
- humanities	2	none	acc	↑	0.8051	±	0.0260
- other	2	none	acc	↑	0.7846	±	0.0275
- social sciences	2	none	acc	↑	0.8444	±	0.0265
- stem	2	none	acc	↑	0.7018	±	0.0258

vllm (pretrained=/root/autodl-tmp/8725-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.868	±	0.0215
		strict-match	5	exact_match	↑	0.856	±	0.0222

vllm (pretrained=/root/autodl-tmp/8725-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.864	±	0.0153
		strict-match	5	exact_match	↑	0.848	±	0.0161

vllm (pretrained=/root/autodl-tmp/8725-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7801	±	0.0133
- humanities	2	none	acc	↑	0.8103	±	0.0265
- other	2	none	acc	↑	0.7846	±	0.0280
- social sciences	2	none	acc	↑	0.8556	±	0.0252
- stem	2	none	acc	↑	0.7088	±	0.0254

vllm (pretrained=/root/autodl-tmp/875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.868	±	0.0215
		strict-match	5	exact_match	↑	0.852	±	0.0225

vllm (pretrained=/root/autodl-tmp/875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.858	±	0.0156
		strict-match	5	exact_match	↑	0.834	±	0.0167

vllm (pretrained=/root/autodl-tmp/875-1024,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7731	±	0.0135
- humanities	2	none	acc	↑	0.8000	±	0.0265
- other	2	none	acc	↑	0.7795	±	0.0279
- social sciences	2	none	acc	↑	0.8389	±	0.0266
- stem	2	none	acc	↑	0.7088	±	0.0256

vllm (pretrained=/root/autodl-tmp/88-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0212
		strict-match	5	exact_match	↑	0.860	±	0.0220

vllm (pretrained=/root/autodl-tmp/88-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.864	±	0.0153
		strict-match	5	exact_match	↑	0.844	±	0.0162

vllm (pretrained=/root/autodl-tmp/88-512,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7754	±	0.0133
- humanities	2	none	acc	↑	0.8103	±	0.0260
- other	2	none	acc	↑	0.7897	±	0.0271
- social sciences	2	none	acc	↑	0.8333	±	0.0267
- stem	2	none	acc	↑	0.7053	±	0.0256

vllm (pretrained=/root/autodl-tmp/905-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0212
		strict-match	5	exact_match	↑	0.868	±	0.0215

vllm (pretrained=/root/autodl-tmp/905-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.860	±	0.0155
		strict-match	5	exact_match	↑	0.826	±	0.0170

vllm (pretrained=/root/autodl-tmp/905-1024,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7743	±	0.0136
- humanities	2	none	acc	↑	0.8051	±	0.0270
- other	2	none	acc	↑	0.7897	±	0.0279
- social sciences	2	none	acc	↑	0.8278	±	0.0271
- stem	2	none	acc	↑	0.7088	±	0.0256

use neuralmagic/LLM_compression_calibration:

vllm (pretrained=/root/autodl-tmp/80-1024-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.892	±	0.0197
		strict-match	5	exact_match	↑	0.872	±	0.0212

vllm (pretrained=/root/autodl-tmp/80-1024-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.876	±	0.0148
		strict-match	5	exact_match	↑	0.850	±	0.0160

vllm (pretrained=/root/autodl-tmp/80-1024-df10,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7731	±	0.0135
- humanities	2	none	acc	↑	0.7846	±	0.0260
- other	2	none	acc	↑	0.7795	±	0.0287
- social sciences	2	none	acc	↑	0.8333	±	0.0269
- stem	2	none	acc	↑	0.7228	±	0.0254

vllm (pretrained=/root/autodl-tmp/86-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.852	±	0.0225
		strict-match	5	exact_match	↑	0.836	±	0.0235

vllm (pretrained=/root/autodl-tmp/86-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.848	±	0.0161
		strict-match	5	exact_match	↑	0.822	±	0.0171

vllm (pretrained=/root/autodl-tmp/86-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7708	±	0.0134
- humanities	2	none	acc	↑	0.8103	±	0.0265
- other	2	none	acc	↑	0.7641	±	0.0277
- social sciences	2	none	acc	↑	0.8500	±	0.0261
- stem	2	none	acc	↑	0.6982	±	0.0257

vllm (pretrained=/root/autodl-tmp/86-512-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.844	±	0.0230
		strict-match	5	exact_match	↑	0.832	±	0.0237

vllm (pretrained=/root/autodl-tmp/86-512-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.868	±	0.0152
		strict-match	5	exact_match	↑	0.846	±	0.0162

vllm (pretrained=/root/autodl-tmp/86-512-df3,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7778	±	0.0134
- humanities	2	none	acc	↑	0.8154	±	0.0256
- other	2	none	acc	↑	0.7897	±	0.0278
- social sciences	2	none	acc	↑	0.8333	±	0.0271
- stem	2	none	acc	↑	0.7088	±	0.0257

vllm (pretrained=/root/autodl-tmp/86-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.872	±	0.0212
		strict-match	5	exact_match	↑	0.856	±	0.0222

vllm (pretrained=/root/autodl-tmp/86-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.870	±	0.0151
		strict-match	5	exact_match	↑	0.846	±	0.0162

vllm (pretrained=/root/autodl-tmp/86-512-df10,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7801	±	0.0134
- humanities	2	none	acc	↑	0.7949	±	0.0272
- other	2	none	acc	↑	0.7846	±	0.0280
- social sciences	2	none	acc	↑	0.8611	±	0.0248
- stem	2	none	acc	↑	0.7158	±	0.0255

vllm (pretrained=/root/autodl-tmp/86-1024-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.880	±	0.0206
		strict-match	5	exact_match	↑	0.872	±	0.0212

vllm (pretrained=/root/autodl-tmp/86-1024-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.868	±	0.0152
		strict-match	5	exact_match	↑	0.840	±	0.0164

vllm (pretrained=/root/autodl-tmp/86-1024-df3,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.7731	±	0.0135
- humanities	2	none	acc	↑	0.8205	±	0.0258
- other	2	none	acc	↑	0.7795	±	0.0279
- social sciences	2	none	acc	↑	0.8500	±	0.0258
- stem	2	none	acc	↑	0.6877	±	0.0263

Cydonia-24B-v2g-Q8_0.gguf:

vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2g-Q8_0.gguf,add_bos_token=true,max_model_len=2048), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.908	±	0.0183
		strict-match	5	exact_match	↑	0.904	±	0.0187

noneUsername
/

Cydonia-24B-v2-W8A8-Defective

Model tree for noneUsername/Cydonia-24B-v2-W8A8-Defective