vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.912 | ± | 0.018 |
strict-match | 5 | exact_match | ↑ | 0.912 | ± | 0.018 |
vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.904 | ± | 0.0132 |
strict-match | 5 | exact_match | ↑ | 0.894 | ± | 0.0138 |
vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7942 | ± | 0.0131 | |
- humanities | 2 | none | acc | ↑ | 0.8205 | ± | 0.0257 | |
- other | 2 | none | acc | ↑ | 0.8103 | ± | 0.0271 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0257 | |
- stem | 2 | none | acc | ↑ | 0.7298 | ± | 0.0249 |
vllm (pretrained=/root/autodl-tmp/80-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.836 | ± | 0.0235 |
strict-match | 5 | exact_match | ↑ | 0.828 | ± | 0.0239 |
vllm (pretrained=/root/autodl-tmp/80-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.864 | ± | 0.0153 |
strict-match | 5 | exact_match | ↑ | 0.840 | ± | 0.0164 |
vllm (pretrained=/root/autodl-tmp/80-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7789 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0266 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0258 | |
- stem | 2 | none | acc | ↑ | 0.7193 | ± | 0.0257 |
vllm (pretrained=/root/autodl-tmp/80-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.852 | ± | 0.0225 |
strict-match | 5 | exact_match | ↑ | 0.824 | ± | 0.0241 |
vllm (pretrained=/root/autodl-tmp/80-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.860 | ± | 0.0155 |
strict-match | 5 | exact_match | ↑ | 0.816 | ± | 0.0173 |
vllm (pretrained=/root/autodl-tmp/80-512-df10,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7778 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.8103 | ± | 0.0261 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0259 | |
- stem | 2 | none | acc | ↑ | 0.7053 | ± | 0.0261 |
vllm (pretrained=/root/autodl-tmp/83-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.860 | ± | 0.0220 |
strict-match | 5 | exact_match | ↑ | 0.852 | ± | 0.0225 |
vllm (pretrained=/root/autodl-tmp/83-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.864 | ± | 0.0153 |
strict-match | 5 | exact_match | ↑ | 0.844 | ± | 0.0162 |
vllm (pretrained=/root/autodl-tmp/83-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8051 | ± | 0.0268 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0275 | |
- social sciences | 2 | none | acc | ↑ | 0.8556 | ± | 0.0255 | |
- stem | 2 | none | acc | ↑ | 0.6982 | ± | 0.0260 |
vllm (pretrained=/root/autodl-tmp/84-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.864 | ± | 0.0217 |
strict-match | 5 | exact_match | ↑ | 0.848 | ± | 0.0228 |
vllm (pretrained=/root/autodl-tmp/84-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.870 | ± | 0.0151 |
strict-match | 5 | exact_match | ↑ | 0.844 | ± | 0.0162 |
vllm (pretrained=/root/autodl-tmp/84-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7754 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0269 | |
- other | 2 | none | acc | ↑ | 0.7897 | ± | 0.0278 | |
- social sciences | 2 | none | acc | ↑ | 0.8667 | ± | 0.0248 | |
- stem | 2 | none | acc | ↑ | 0.6912 | ± | 0.0259 |
vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.856 | ± | 0.0222 |
strict-match | 5 | exact_match | ↑ | 0.840 | ± | 0.0232 |
vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0150 |
strict-match | 5 | exact_match | ↑ | 0.840 | ± | 0.0164 |
vllm (pretrained=/root/autodl-tmp/85-512,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0259 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8611 | ± | 0.0251 | |
- stem | 2 | none | acc | ↑ | 0.6947 | ± | 0.0262 |
vllm (pretrained=/root/autodl-tmp/86-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
strict-match | 5 | exact_match | ↑ | 0.856 | ± | 0.0222 |
vllm (pretrained=/root/autodl-tmp/86-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.88 | ± | 0.0145 |
strict-match | 5 | exact_match | ↑ | 0.85 | ± | 0.0160 |
vllm (pretrained=/root/autodl-tmp/86-1024,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7731 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8051 | ± | 0.0262 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0275 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0261 | |
- stem | 2 | none | acc | ↑ | 0.7018 | ± | 0.0256 |
vllm (pretrained=/root/autodl-tmp/865-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
strict-match | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
vllm (pretrained=/root/autodl-tmp/865-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.862 | ± | 0.0154 |
strict-match | 5 | exact_match | ↑ | 0.832 | ± | 0.0167 |
vllm (pretrained=/root/autodl-tmp/865-1024,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7719 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.8051 | ± | 0.0265 | |
- other | 2 | none | acc | ↑ | 0.7744 | ± | 0.0277 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0263 | |
- stem | 2 | none | acc | ↑ | 0.7018 | ± | 0.0261 |
vllm (pretrained=/root/autodl-tmp/8675-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.876 | ± | 0.0209 |
strict-match | 5 | exact_match | ↑ | 0.852 | ± | 0.0225 |
vllm (pretrained=/root/autodl-tmp/8675-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.878 | ± | 0.0147 |
strict-match | 5 | exact_match | ↑ | 0.854 | ± | 0.0158 |
vllm (pretrained=/root/autodl-tmp/8675-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7836 | ± | 0.0132 | |
- humanities | 2 | none | acc | ↑ | 0.7949 | ± | 0.0263 | |
- other | 2 | none | acc | ↑ | 0.7949 | ± | 0.0270 | |
- social sciences | 2 | none | acc | ↑ | 0.8556 | ± | 0.0254 | |
- stem | 2 | none | acc | ↑ | 0.7228 | ± | 0.0255 |
vllm (pretrained=/root/autodl-tmp/8675-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
strict-match | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
vllm (pretrained=/root/autodl-tmp/8675-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.886 | ± | 0.0142 |
strict-match | 5 | exact_match | ↑ | 0.858 | ± | 0.0156 |
vllm (pretrained=/root/autodl-tmp/8675-3048,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7719 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.7846 | ± | 0.0264 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0278 | |
- social sciences | 2 | none | acc | ↑ | 0.8556 | ± | 0.0251 | |
- stem | 2 | none | acc | ↑ | 0.7018 | ± | 0.0259 |
vllm (pretrained=/root/autodl-tmp/86875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.888 | ± | 0.0200 |
strict-match | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
vllm (pretrained=/root/autodl-tmp/86875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.876 | ± | 0.0148 |
strict-match | 5 | exact_match | ↑ | 0.850 | ± | 0.0160 |
vllm (pretrained=/root/autodl-tmp/86875-1024,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8154 | ± | 0.0260 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0257 | |
- stem | 2 | none | acc | ↑ | 0.6947 | ± | 0.0258 |
vllm (pretrained=/root/autodl-tmp/86875-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.884 | ± | 0.0203 |
strict-match | 5 | exact_match | ↑ | 0.876 | ± | 0.0209 |
vllm (pretrained=/root/autodl-tmp/86875-3048,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0145 |
strict-match | 5 | exact_match | ↑ | 0.856 | ± | 0.0157 |
vllm (pretrained=/root/autodl-tmp/86875-3048,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7731 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8205 | ± | 0.0256 | |
- other | 2 | none | acc | ↑ | 0.7897 | ± | 0.0281 | |
- social sciences | 2 | none | acc | ↑ | 0.8278 | ± | 0.0271 | |
- stem | 2 | none | acc | ↑ | 0.6947 | ± | 0.0252 |
vllm (pretrained=/root/autodl-tmp/869-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.888 | ± | 0.0200 |
strict-match | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
vllm (pretrained=/root/autodl-tmp/869-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.876 | ± | 0.0148 |
strict-match | 5 | exact_match | ↑ | 0.850 | ± | 0.0160 |
vllm (pretrained=/root/autodl-tmp/869-1024,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8154 | ± | 0.0260 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0257 | |
- stem | 2 | none | acc | ↑ | 0.6947 | ± | 0.0258 |
vllm (pretrained=/root/autodl-tmp/869-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.836 | ± | 0.0235 |
strict-match | 5 | exact_match | ↑ | 0.820 | ± | 0.0243 |
vllm (pretrained=/root/autodl-tmp/869-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.848 | ± | 0.0161 |
strict-match | 5 | exact_match | ↑ | 0.822 | ± | 0.0171 |
vllm (pretrained=/root/autodl-tmp/869-1536,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.7949 | ± | 0.0266 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0275 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0255 | |
- stem | 2 | none | acc | ↑ | 0.7053 | ± | 0.0258 |
vllm (pretrained=/root/autodl-tmp/8695-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.860 | ± | 0.0220 |
strict-match | 5 | exact_match | ↑ | 0.852 | ± | 0.0225 |
vllm (pretrained=/root/autodl-tmp/8695-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.876 | ± | 0.0148 |
strict-match | 5 | exact_match | ↑ | 0.848 | ± | 0.0161 |
vllm (pretrained=/root/autodl-tmp/8695-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0260 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0279 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0258 | |
- stem | 2 | none | acc | ↑ | 0.7053 | ± | 0.0256 |
vllm (pretrained=/root/autodl-tmp/87-128,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.848 | ± | 0.0228 |
strict-match | 5 | exact_match | ↑ | 0.836 | ± | 0.0235 |
vllm (pretrained=/root/autodl-tmp/87-128,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.85 | ± | 0.0160 |
strict-match | 5 | exact_match | ↑ | 0.83 | ± | 0.0168 |
vllm (pretrained=/root/autodl-tmp/87-128,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7719 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0263 | |
- other | 2 | none | acc | ↑ | 0.7897 | ± | 0.0271 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0261 | |
- stem | 2 | none | acc | ↑ | 0.6947 | ± | 0.0262 |
vllm (pretrained=/root/autodl-tmp/87-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
strict-match | 5 | exact_match | ↑ | 0.856 | ± | 0.0222 |
vllm (pretrained=/root/autodl-tmp/87-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.862 | ± | 0.0154 |
strict-match | 5 | exact_match | ↑ | 0.840 | ± | 0.0164 |
vllm (pretrained=/root/autodl-tmp/87-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7848 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8154 | ± | 0.0261 | |
- other | 2 | none | acc | ↑ | 0.8000 | ± | 0.0276 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0261 | |
- stem | 2 | none | acc | ↑ | 0.7158 | ± | 0.0260 |
vllm (pretrained=/root/autodl-tmp/87-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
strict-match | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
vllm (pretrained=/root/autodl-tmp/87-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0145 |
strict-match | 5 | exact_match | ↑ | 0.854 | ± | 0.0158 |
vllm (pretrained=/root/autodl-tmp/87-1536,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7836 | ± | 0.0131 | |
- humanities | 2 | none | acc | ↑ | 0.8359 | ± | 0.0249 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0279 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0259 | |
- stem | 2 | none | acc | ↑ | 0.7123 | ± | 0.0251 |
vllm (pretrained=/root/autodl-tmp/87-1536-df01,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
strict-match | 5 | exact_match | ↑ | 0.864 | ± | 0.0217 |
vllm (pretrained=/root/autodl-tmp/87-1536-df01,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0150 |
strict-match | 5 | exact_match | ↑ | 0.848 | ± | 0.0161 |
vllm (pretrained=/root/autodl-tmp/87-1536-df01,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7708 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0265 | |
- other | 2 | none | acc | ↑ | 0.7744 | ± | 0.0281 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0261 | |
- stem | 2 | none | acc | ↑ | 0.7018 | ± | 0.0261 |
vllm (pretrained=/root/autodl-tmp/8701-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
strict-match | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
vllm (pretrained=/root/autodl-tmp/8701-1536,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0145 |
strict-match | 5 | exact_match | ↑ | 0.854 | ± | 0.0158 |
vllm (pretrained=/root/autodl-tmp/8701-1536,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7836 | ± | 0.0131 | |
- humanities | 2 | none | acc | ↑ | 0.8359 | ± | 0.0249 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0279 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0259 | |
- stem | 2 | none | acc | ↑ | 0.7123 | ± | 0.0251 |
vllm (pretrained=/root/autodl-tmp/87125-512-df2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.860 | ± | 0.0220 |
strict-match | 5 | exact_match | ↑ | 0.848 | ± | 0.0228 |
llm (pretrained=/root/autodl-tmp/87125-512-df2,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.874 | ± | 0.0149 |
strict-match | 5 | exact_match | ↑ | 0.852 | ± | 0.0159 |
vllm (pretrained=/root/autodl-tmp/87125-512-df2,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8051 | ± | 0.0260 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0275 | |
- social sciences | 2 | none | acc | ↑ | 0.8444 | ± | 0.0265 | |
- stem | 2 | none | acc | ↑ | 0.7018 | ± | 0.0258 |
vllm (pretrained=/root/autodl-tmp/8725-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
strict-match | 5 | exact_match | ↑ | 0.856 | ± | 0.0222 |
vllm (pretrained=/root/autodl-tmp/8725-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.864 | ± | 0.0153 |
strict-match | 5 | exact_match | ↑ | 0.848 | ± | 0.0161 |
vllm (pretrained=/root/autodl-tmp/8725-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7801 | ± | 0.0133 | |
- humanities | 2 | none | acc | ↑ | 0.8103 | ± | 0.0265 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8556 | ± | 0.0252 | |
- stem | 2 | none | acc | ↑ | 0.7088 | ± | 0.0254 |
vllm (pretrained=/root/autodl-tmp/875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
strict-match | 5 | exact_match | ↑ | 0.852 | ± | 0.0225 |
vllm (pretrained=/root/autodl-tmp/875-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.858 | ± | 0.0156 |
strict-match | 5 | exact_match | ↑ | 0.834 | ± | 0.0167 |
vllm (pretrained=/root/autodl-tmp/875-1024,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7731 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.8000 | ± | 0.0265 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0279 | |
- social sciences | 2 | none | acc | ↑ | 0.8389 | ± | 0.0266 | |
- stem | 2 | none | acc | ↑ | 0.7088 | ± | 0.0256 |
vllm (pretrained=/root/autodl-tmp/88-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
strict-match | 5 | exact_match | ↑ | 0.860 | ± | 0.0220 |
vllm (pretrained=/root/autodl-tmp/88-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.864 | ± | 0.0153 |
strict-match | 5 | exact_match | ↑ | 0.844 | ± | 0.0162 |
vllm (pretrained=/root/autodl-tmp/88-512,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7754 | ± | 0.0133 | |
- humanities | 2 | none | acc | ↑ | 0.8103 | ± | 0.0260 | |
- other | 2 | none | acc | ↑ | 0.7897 | ± | 0.0271 | |
- social sciences | 2 | none | acc | ↑ | 0.8333 | ± | 0.0267 | |
- stem | 2 | none | acc | ↑ | 0.7053 | ± | 0.0256 |
vllm (pretrained=/root/autodl-tmp/905-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
strict-match | 5 | exact_match | ↑ | 0.868 | ± | 0.0215 |
vllm (pretrained=/root/autodl-tmp/905-1024,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.860 | ± | 0.0155 |
strict-match | 5 | exact_match | ↑ | 0.826 | ± | 0.0170 |
vllm (pretrained=/root/autodl-tmp/905-1024,add_bos_token=true,max_model_len=700,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7743 | ± | 0.0136 | |
- humanities | 2 | none | acc | ↑ | 0.8051 | ± | 0.0270 | |
- other | 2 | none | acc | ↑ | 0.7897 | ± | 0.0279 | |
- social sciences | 2 | none | acc | ↑ | 0.8278 | ± | 0.0271 | |
- stem | 2 | none | acc | ↑ | 0.7088 | ± | 0.0256 |
use neuralmagic/LLM_compression_calibration:
vllm (pretrained=/root/autodl-tmp/80-1024-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.892 | ± | 0.0197 |
strict-match | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
vllm (pretrained=/root/autodl-tmp/80-1024-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.876 | ± | 0.0148 |
strict-match | 5 | exact_match | ↑ | 0.850 | ± | 0.0160 |
vllm (pretrained=/root/autodl-tmp/80-1024-df10,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7731 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.7846 | ± | 0.0260 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0287 | |
- social sciences | 2 | none | acc | ↑ | 0.8333 | ± | 0.0269 | |
- stem | 2 | none | acc | ↑ | 0.7228 | ± | 0.0254 |
vllm (pretrained=/root/autodl-tmp/86-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.852 | ± | 0.0225 |
strict-match | 5 | exact_match | ↑ | 0.836 | ± | 0.0235 |
vllm (pretrained=/root/autodl-tmp/86-512,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.848 | ± | 0.0161 |
strict-match | 5 | exact_match | ↑ | 0.822 | ± | 0.0171 |
vllm (pretrained=/root/autodl-tmp/86-512,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7708 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8103 | ± | 0.0265 | |
- other | 2 | none | acc | ↑ | 0.7641 | ± | 0.0277 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0261 | |
- stem | 2 | none | acc | ↑ | 0.6982 | ± | 0.0257 |
vllm (pretrained=/root/autodl-tmp/86-512-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.844 | ± | 0.0230 |
strict-match | 5 | exact_match | ↑ | 0.832 | ± | 0.0237 |
vllm (pretrained=/root/autodl-tmp/86-512-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.868 | ± | 0.0152 |
strict-match | 5 | exact_match | ↑ | 0.846 | ± | 0.0162 |
vllm (pretrained=/root/autodl-tmp/86-512-df3,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7778 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.8154 | ± | 0.0256 | |
- other | 2 | none | acc | ↑ | 0.7897 | ± | 0.0278 | |
- social sciences | 2 | none | acc | ↑ | 0.8333 | ± | 0.0271 | |
- stem | 2 | none | acc | ↑ | 0.7088 | ± | 0.0257 |
vllm (pretrained=/root/autodl-tmp/86-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
strict-match | 5 | exact_match | ↑ | 0.856 | ± | 0.0222 |
vllm (pretrained=/root/autodl-tmp/86-512-df10,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.870 | ± | 0.0151 |
strict-match | 5 | exact_match | ↑ | 0.846 | ± | 0.0162 |
vllm (pretrained=/root/autodl-tmp/86-512-df10,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7801 | ± | 0.0134 | |
- humanities | 2 | none | acc | ↑ | 0.7949 | ± | 0.0272 | |
- other | 2 | none | acc | ↑ | 0.7846 | ± | 0.0280 | |
- social sciences | 2 | none | acc | ↑ | 0.8611 | ± | 0.0248 | |
- stem | 2 | none | acc | ↑ | 0.7158 | ± | 0.0255 |
vllm (pretrained=/root/autodl-tmp/86-1024-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.880 | ± | 0.0206 |
strict-match | 5 | exact_match | ↑ | 0.872 | ± | 0.0212 |
vllm (pretrained=/root/autodl-tmp/86-1024-df3,add_bos_token=true,max_model_len=2048,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.868 | ± | 0.0152 |
strict-match | 5 | exact_match | ↑ | 0.840 | ± | 0.0164 |
vllm (pretrained=/root/autodl-tmp/86-1024-df3,add_bos_token=true,max_model_len=800,dtype=bfloat16), gen_kwargs: (None), limit: 15.0, num_fewshot: None, batch_size: 1
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7731 | ± | 0.0135 | |
- humanities | 2 | none | acc | ↑ | 0.8205 | ± | 0.0258 | |
- other | 2 | none | acc | ↑ | 0.7795 | ± | 0.0279 | |
- social sciences | 2 | none | acc | ↑ | 0.8500 | ± | 0.0258 | |
- stem | 2 | none | acc | ↑ | 0.6877 | ± | 0.0263 |
Cydonia-24B-v2g-Q8_0.gguf:
vllm (pretrained=/root/autodl-tmp/Cydonia-24B-v2g-Q8_0.gguf,add_bos_token=true,max_model_len=2048), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.908 | ± | 0.0183 |
strict-match | 5 | exact_match | ↑ | 0.904 | ± | 0.0187 |
- Downloads last month
- 8
Model tree for noneUsername/Cydonia-24B-v2-W8A8-Defective
Base model
TheDrummer/Cydonia-24B-v2