agentlans's picture
Adding Evaluation Results (#1)
7bec0a6 verified
metadata
license: llama3.1
datasets:
  - agentlans/crash-course
base_model:
  - agentlans/Llama3.1-SuperDeepFuse
model-index:
  - name: Llama3.1-SuperDeepFuse-CrashCourse12K
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: wis-k/instruction-following-eval
          split: train
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 71.87
            name: averaged accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=agentlans%2FLlama3.1-SuperDeepFuse-CrashCourse12K
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: SaylorTwift/bbh
          split: test
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 31.83
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=agentlans%2FLlama3.1-SuperDeepFuse-CrashCourse12K
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: lighteval/MATH-Hard
          split: test
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 17.67
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=agentlans%2FLlama3.1-SuperDeepFuse-CrashCourse12K
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          split: train
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.39
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=agentlans%2FLlama3.1-SuperDeepFuse-CrashCourse12K
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.6
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=agentlans%2FLlama3.1-SuperDeepFuse-CrashCourse12K
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 29.24
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=agentlans%2FLlama3.1-SuperDeepFuse-CrashCourse12K
          name: Open LLM Leaderboard

Llama3.1-SuperDeepFuse-CrashCourse12K

Llama3.1-SuperDeepFuse-CrashCourse12K is an 8B parameter language model based on Llama3.1-SuperDeepFuse and further fine-tuned on agentlans/crash-course.

Model Details

  • Base Model: Llama3.1-SuperDeepFuse (8B parameters)
  • Fine-tuning Dataset: 12 000 samples from agentlans/crash-course (containing samples from 10 high-quality instruct datasets)
  • Model Type: Instruction-tuned language model
  • Language(s): Multilingual
  • License: Follows standard Llama 3.1 usage terms

Training Procedure

Fine-tuning

  • Method: LoRA (Low-Rank Adaptation)
  • Optimizer: AdamW
  • Learning Rate: 5e-5
  • Batch Size: 2 per device
  • Gradient Accumulation Steps: 8
  • Training Epochs: 1
  • Max Sequence Length: 2048
  • LoRA Configuration:
    • Rank: 8
    • Alpha: 16
    • Dropout: 0.5
    • Target: all layers
  • Quantization: 4-bit (bitsandbytes)
  • Precision: BF16
  • Other Techniques: NEFTune (noise alpha: 5), RS-LoRA

Performance and Limitations

This model potentially offers:

  • Enhanced multi-task reasoning
  • Improved performance in mathematics and coding tasks
  • Better instruction-following abilities

However:

  • Performance may be limited compared to larger model variants
  • Can produce misleading or incorrect outputs
  • Outputs should be independently verified for critical applications

Additional Information

Open LLM Leaderboard Evaluation Results

Detailed results can be found here! Summarized results can be found here!

Metric Value (%)
Average 27.93
IFEval (0-Shot) 71.87
BBH (3-Shot) 31.83
MATH Lvl 5 (4-Shot) 17.67
GPQA (0-shot) 8.39
MuSR (0-shot) 8.60
MMLU-PRO (5-shot) 29.24