Llama-3.2-1B-Instruct-ORPO

Evaluation Environmental Inpact

Model Details

This model was obtained by finetuning the open source Llama-3.2-1B-Instruct model on the mlabonne/orpo-dpo-mix-40k dataset, leveraging Odds Ratio Preference Optimization (ORPO) for Reinforcement Learning.

Uses

This model is optimized for general-purpose language tasks.

Evaluation

We used the Eulether test harness to evaluate the finetuned model. The table below presents a summary of the evaluation performed.

For a more granular evaluation on MMLU, please see Section MMLU.

Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc ↑ 0.4507 ± 0.0050
none 0 acc_norm ↑ 0.6077 ± 0.0049
arc_easy 1 none 0 acc ↑ 0.6856 ± 0.0095
none 0 acc_norm ↑ 0.6368 ± 0.0099
mmlu 2 none acc ↑ 0.4597 ± 0.0041
- humanities 2 none acc ↑ 0.4434 ± 0.0071
- other 2 none acc ↑ 0.5163 ± 0.0088
- social sciences 2 none acc ↑ 0.5057 ± 0.0088
- stem 2 none acc ↑ 0.3834 ± 0.0085

Top

MMLU

Tasks Version Filter n-shot Metric Value Stderr
mmlu 2 none acc ↑ 0.4597 ± 0.0041
- humanities 2 none acc ↑ 0.4434 ± 0.0071
- formal_logic 1 none 0 acc ↑ 0.3254 ± 0.0419
- high_school_european_history 1 none 0 acc ↑ 0.6182 ± 0.0379
- high_school_us_history 1 none 0 acc ↑ 0.5784 ± 0.0347
- high_school_world_history 1 none 0 acc ↑ 0.6540 ± 0.0310
- international_law 1 none 0 acc ↑ 0.6033 ± 0.0447
- jurisprudence 1 none 0 acc ↑ 0.5370 ± 0.0482
- logical_fallacies 1 none 0 acc ↑ 0.4479 ± 0.0391
- moral_disputes 1 none 0 acc ↑ 0.4711 ± 0.0269
- moral_scenarios 1 none 0 acc ↑ 0.3408 ± 0.0159
- philosophy 1 none 0 acc ↑ 0.5177 ± 0.0284
- prehistory 1 none 0 acc ↑ 0.5278 ± 0.0278
- professional_law 1 none 0 acc ↑ 0.3683 ± 0.0123
- world_religions 1 none 0 acc ↑ 0.5906 ± 0.0377
- other 2 none acc ↑ 0.5163 ± 0.0088
- business_ethics 1 none 0 acc ↑ 0.4300 ± 0.0498
- clinical_knowledge 1 none 0 acc ↑ 0.4642 ± 0.0307
- college_medicine 1 none 0 acc ↑ 0.3815 ± 0.0370
- global_facts 1 none 0 acc ↑ 0.3200 ± 0.0469
- human_aging 1 none 0 acc ↑ 0.5157 ± 0.0335
- management 1 none 0 acc ↑ 0.5243 ± 0.0494
- marketing 1 none 0 acc ↑ 0.6709 ± 0.0308
- medical_genetics 1 none 0 acc ↑ 0.4800 ± 0.0502
- miscellaneous 1 none 0 acc ↑ 0.6015 ± 0.0175
- nutrition 1 none 0 acc ↑ 0.5686 ± 0.0284
- professional_accounting 1 none 0 acc ↑ 0.3511 ± 0.0285
- professional_medicine 1 none 0 acc ↑ 0.5625 ± 0.0301
- virology 1 none 0 acc ↑ 0.4157 ± 0.0384
- social sciences 2 none acc ↑ 0.5057 ± 0.0088
- econometrics 1 none 0 acc ↑ 0.2456 ± 0.0405
- high_school_geography 1 none 0 acc ↑ 0.5606 ± 0.0354
- high_school_government_and_politics 1 none 0 acc ↑ 0.5389 ± 0.0360
- high_school_macroeconomics 1 none 0 acc ↑ 0.4128 ± 0.0250
- high_school_microeconomics 1 none 0 acc ↑ 0.4454 ± 0.0323
- high_school_psychology 1 none 0 acc ↑ 0.6183 ± 0.0208
- human_sexuality 1 none 0 acc ↑ 0.5420 ± 0.0437
- professional_psychology 1 none 0 acc ↑ 0.4167 ± 0.0199
- public_relations 1 none 0 acc ↑ 0.5000 ± 0.0479
- security_studies 1 none 0 acc ↑ 0.5265 ± 0.0320
- sociology 1 none 0 acc ↑ 0.6468 ± 0.0338
- us_foreign_policy 1 none 0 acc ↑ 0.6900 ± 0.0465
- stem 2 none acc ↑ 0.3834 ± 0.0085
- abstract_algebra 1 none 0 acc ↑ 0.2500 ± 0.0435
- anatomy 1 none 0 acc ↑ 0.4889 ± 0.0432
- astronomy 1 none 0 acc ↑ 0.5329 ± 0.0406
- college_biology 1 none 0 acc ↑ 0.4931 ± 0.0418
- college_chemistry 1 none 0 acc ↑ 0.3800 ± 0.0488
- college_computer_science 1 none 0 acc ↑ 0.3300 ± 0.0473
- college_mathematics 1 none 0 acc ↑ 0.2800 ± 0.0451
- college_physics 1 none 0 acc ↑ 0.2451 ± 0.0428
- computer_security 1 none 0 acc ↑ 0.4800 ± 0.0502
- conceptual_physics 1 none 0 acc ↑ 0.4383 ± 0.0324
- electrical_engineering 1 none 0 acc ↑ 0.5310 ± 0.0416
- elementary_mathematics 1 none 0 acc ↑ 0.2884 ± 0.0233
- high_school_biology 1 none 0 acc ↑ 0.4935 ± 0.0284
- high_school_chemistry 1 none 0 acc ↑ 0.3645 ± 0.0339
- high_school_computer_science 1 none 0 acc ↑ 0.4500 ± 0.0500
- high_school_mathematics 1 none 0 acc ↑ 0.2815 ± 0.0274
- high_school_physics 1 none 0 acc ↑ 0.3113 ± 0.0378
- high_school_statistics 1 none 0 acc ↑ 0.3657 ± 0.0328
- machine_learning 1 none 0 acc ↑ 0.2768 ± 0.0425

Top

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: MacBook Air M1
  • Hours used: 1
  • Cloud Provider: GPC, A100
  • Compute Region: US-EAST1
  • Carbon Emitted: 0.09 kgCO2 of which 100 percents were directly offset by the cloud provider.

Top

Downloads last month
12
Safetensors
Model size
1.24B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for ramonactruta/ramonactruta-llama-3.2.Instruct

Finetuned
(231)
this model

Dataset used to train ramonactruta/ramonactruta-llama-3.2.Instruct

Evaluation results