strix-rufipes-70b / README.md
ibivibiv's picture
Adding Evaluation Results (#1)
92adaa6 verified
|
raw
history blame
12.8 kB
metadata
language:
  - en
license: llama2
tags:
  - logic
  - planning
model-index:
  - name: strix-rufipes-70b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 71.33
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ibivibiv/strix-rufipes-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 87.86
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ibivibiv/strix-rufipes-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 69.13
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ibivibiv/strix-rufipes-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 56.72
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ibivibiv/strix-rufipes-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 84.77
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ibivibiv/strix-rufipes-70b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 53.83
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=ibivibiv/strix-rufipes-70b
          name: Open LLM Leaderboard

Strix Rufipes 70B

img

Prompting

Prompt Template for alpaca style

### Instruction:

<prompt> (without the <>)

### Response:

Sample Code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("ibivibiv/strix-rufipes-70b", torch_dtype="auto", device_config='auto')
tokenizer = AutoTokenizer.from_pretrained("ibivibiv/strix-rufipes-70b")

inputs = tokenizer("### Instruction: Create a plan for developing the game of snake in python using pygame.\n### Response:\n", return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Model Details

  • Trained by: ibivibiv
  • Library: HuggingFace Transformers
  • Model type: strix-rufipes-70b is an auto-regressive language model fine tuned on the Llama 2 transformer architecture.
  • Language(s): English
  • Purpose: Has specific training for logic enforcement. This model is targeted towards planning exercises.

Benchmark Scores

Test Name Accuracy
average of all 0.6910894247381432
arc:challenge 0.674061433447099
hellaswag 0.6898028281218881
hendrycksTest-abstract_algebra 0.36
hendrycksTest-anatomy 0.6370370370370371
hendrycksTest-astronomy 0.7960526315789473
hendrycksTest-business_ethics 0.73
hendrycksTest-clinical_knowledge 0.7169811320754716
hendrycksTest-college_biology 0.8125
hendrycksTest-college_chemistry 0.47
hendrycksTest-college_computer_science 0.56
hendrycksTest-college_mathematics 0.36
hendrycksTest-college_medicine 0.6820809248554913
hendrycksTest-college_physics 0.43137254901960786
hendrycksTest-computer_security 0.75
hendrycksTest-conceptual_physics 0.6851063829787234
hendrycksTest-econometrics 0.4824561403508772
hendrycksTest-electrical_engineering 0.5793103448275863
hendrycksTest-elementary_mathematics 0.41534391534391535
hendrycksTest-formal_logic 0.48412698412698413
hendrycksTest-global_facts 0.5
hendrycksTest-high_school_biology 0.8064516129032258
hendrycksTest-high_school_chemistry 0.5073891625615764
hendrycksTest-high_school_computer_science 0.71
hendrycksTest-high_school_european_history 0.8424242424242424
hendrycksTest-high_school_geography 0.8787878787878788
hendrycksTest-high_school_government_and_politics 0.9326424870466321
hendrycksTest-high_school_macroeconomics 0.717948717948718
hendrycksTest-high_school_mathematics 0.2962962962962963
hendrycksTest-high_school_microeconomics 0.7521008403361344
hendrycksTest-high_school_physics 0.48344370860927155
hendrycksTest-high_school_psychology 0.8788990825688073
hendrycksTest-high_school_statistics 0.5277777777777778
hendrycksTest-high_school_us_history 0.9019607843137255
hendrycksTest-high_school_world_history 0.8776371308016878
hendrycksTest-human_aging 0.7802690582959642
hendrycksTest-human_sexuality 0.8244274809160306
hendrycksTest-international_law 0.8677685950413223
hendrycksTest-jurisprudence 0.8148148148148148
hendrycksTest-logical_fallacies 0.7914110429447853
hendrycksTest-machine_learning 0.5357142857142857
hendrycksTest-management 0.8543689320388349
hendrycksTest-marketing 0.8974358974358975
hendrycksTest-medical_genetics 0.73
hendrycksTest-miscellaneous 0.8569604086845466
hendrycksTest-moral_disputes 0.7687861271676301
hendrycksTest-moral_scenarios 0.5184357541899441
hendrycksTest-nutrition 0.7679738562091504
hendrycksTest-philosophy 0.7620578778135049
hendrycksTest-prehistory 0.8271604938271605
hendrycksTest-professional_accounting 0.5390070921985816
hendrycksTest-professional_law 0.5743155149934811
hendrycksTest-professional_medicine 0.6911764705882353
hendrycksTest-professional_psychology 0.7565359477124183
hendrycksTest-public_relations 0.7272727272727273
hendrycksTest-security_studies 0.8
hendrycksTest-sociology 0.8507462686567164
hendrycksTest-us_foreign_policy 0.89
hendrycksTest-virology 0.5542168674698795
hendrycksTest-world_religions 0.8596491228070176
truthfulqa 0.4712300987333333
winogrande 0.8476716653512234
gsm8k 0.5382865807429871

Citations

@misc{open-llm-leaderboard,
  author = {Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
  title = {Open LLM Leaderboard},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
}
@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628}
}
@misc{clark2018think,
      title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
      author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
      year={2018},
      eprint={1803.05457},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}
@misc{zellers2019hellaswag,
      title={HellaSwag: Can a Machine Really Finish Your Sentence?},
      author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
      year={2019},
      eprint={1905.07830},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{hendrycks2021measuring,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      year={2021},
      eprint={2009.03300},
      archivePrefix={arXiv},
      primaryClass={cs.CY}
}
@misc{lin2022truthfulqa,
      title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
      author={Stephanie Lin and Jacob Hilton and Owain Evans},
      year={2022},
      eprint={2109.07958},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{DBLP:journals/corr/abs-1907-10641,
      title={{WINOGRANDE:} An Adversarial Winograd Schema Challenge at Scale},
      author={Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
      year={2019},
      eprint={1907.10641},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{DBLP:journals/corr/abs-2110-14168,
      title={Training Verifiers to Solve Math Word Problems},
      author={Karl Cobbe and
                  Vineet Kosaraju and
                  Mohammad Bavarian and
                  Mark Chen and
                  Heewoo Jun and
                  Lukasz Kaiser and
                  Matthias Plappert and
                  Jerry Tworek and
                  Jacob Hilton and
                  Reiichiro Nakano and
                  Christopher Hesse and
                  John Schulman},
      year={2021},
      eprint={2110.14168},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 70.61
AI2 Reasoning Challenge (25-Shot) 71.33
HellaSwag (10-Shot) 87.86
MMLU (5-Shot) 69.13
TruthfulQA (0-shot) 56.72
Winogrande (5-shot) 84.77
GSM8k (5-shot) 53.83