coven_7b_128k_orpo_alpha / bench-coven.txt
raidhon's picture
Upload folder using huggingface_hub
d4d334b verified
raidhon/coven_7b_128k_orpo_alpha
| Tasks |Version| Filter |n-shot| Metric | Value | |Stderr|
|---------------------------------------|-------|----------------|-----:|-----------|------:|---|-----:|
|winogrande | 1|none | 0|acc | 0.7782|± |0.0117|
|truthfulqa |N/A |none | 0|rouge1_max |47.8575|± |0.8139|
| | |none | 0|bleu_max |21.9412|± |0.7280|
| | |none | 0|rouge2_max |32.7726|± |0.9228|
| | |none | 0|rougeL_diff|-1.4310|± |0.7806|
| | |none | 0|acc | 0.4955|± |0.0115|
| | |none | 0|bleu_diff |-0.2883|± |0.6228|
| | |none | 0|rouge2_acc | 0.3807|± |0.0170|
| | |none | 0|rougeL_max |44.1785|± |0.8274|
| | |none | 0|rougeL_acc | 0.4443|± |0.0174|
| | |none | 0|rouge2_diff|-1.5603|± |0.8950|
| | |none | 0|bleu_acc | 0.4321|± |0.0173|
| | |none | 0|rouge1_diff|-0.7276|± |0.7721|
| | |none | 0|rouge1_acc | 0.4774|± |0.0175|
| - truthfulqa_gen | 3|none | 0|bleu_max |21.9412|± |0.7280|
| | |none | 0|bleu_acc | 0.4321|± |0.0173|
| | |none | 0|bleu_diff |-0.2883|± |0.6228|
| | |none | 0|rouge1_max |47.8575|± |0.8139|
| | |none | 0|rouge1_acc | 0.4774|± |0.0175|
| | |none | 0|rouge1_diff|-0.7276|± |0.7721|
| | |none | 0|rouge2_max |32.7726|± |0.9228|
| | |none | 0|rouge2_acc | 0.3807|± |0.0170|
| | |none | 0|rouge2_diff|-1.5603|± |0.8950|
| | |none | 0|rougeL_max |44.1785|± |0.8274|
| | |none | 0|rougeL_acc | 0.4443|± |0.0174|
| | |none | 0|rougeL_diff|-1.4310|± |0.7806|
| - truthfulqa_mc1 | 2|none | 0|acc | 0.4174|± |0.0173|
| - truthfulqa_mc2 | 2|none | 0|acc | 0.5736|± |0.0151|
|piqa | 1|none | 0|acc | 0.8205|± |0.0090|
| | |none | 0|acc_norm | 0.8395|± |0.0086|
|openbookqa | 1|none | 0|acc | 0.3460|± |0.0213|
| | |none | 0|acc_norm | 0.4800|± |0.0224|
|mmlu |N/A |none | 0|acc | 0.6300|± |0.0038|
| - humanities |N/A |none | 0|acc | 0.5779|± |0.0066|
| - formal_logic | 0|none | 0|acc | 0.4127|± |0.0440|
| - high_school_european_history | 0|none | 0|acc | 0.8061|± |0.0309|
| - high_school_us_history | 0|none | 0|acc | 0.8480|± |0.0252|
| - high_school_world_history | 0|none | 0|acc | 0.8523|± |0.0231|
| - international_law | 0|none | 0|acc | 0.7934|± |0.0370|
| - international_law | 0|none | 0|acc | 0.7934|± |0.0370| [46/1966]
| - jurisprudence | 0|none | 0|acc | 0.7685|± |0.0408|
| - logical_fallacies | 0|none | 0|acc | 0.7730|± |0.0329|
| - moral_disputes | 0|none | 0|acc | 0.7110|± |0.0244|
| - moral_scenarios | 0|none | 0|acc | 0.2894|± |0.0152|
| - philosophy | 0|none | 0|acc | 0.7106|± |0.0258|
| - prehistory | 0|none | 0|acc | 0.7685|± |0.0235|
| - professional_law | 0|none | 0|acc | 0.4824|± |0.0128|
| - world_religions | 0|none | 0|acc | 0.8129|± |0.0299|
| - other |N/A |none | 0|acc | 0.7090|± |0.0078|
| - business_ethics | 0|none | 0|acc | 0.5900|± |0.0494|
| - clinical_knowledge | 0|none | 0|acc | 0.7245|± |0.0275|
| - college_medicine | 0|none | 0|acc | 0.6532|± |0.0363|
| - global_facts | 0|none | 0|acc | 0.3200|± |0.0469|
| - human_aging | 0|none | 0|acc | 0.7040|± |0.0306|
| - management | 0|none | 0|acc | 0.7864|± |0.0406|
| - marketing | 0|none | 0|acc | 0.8632|± |0.0225|
| - medical_genetics | 0|none | 0|acc | 0.7500|± |0.0435|
| - miscellaneous | 0|none | 0|acc | 0.8212|± |0.0137|
| - nutrition | 0|none | 0|acc | 0.7451|± |0.0250|
| - professional_accounting | 0|none | 0|acc | 0.5000|± |0.0298|
| - professional_medicine | 0|none | 0|acc | 0.7059|± |0.0277|
| - virology | 0|none | 0|acc | 0.5301|± |0.0389|
| - social_sciences |N/A |none | 0|acc | 0.7358|± |0.0077|
| - econometrics | 0|none | 0|acc | 0.4474|± |0.0468|
| - high_school_geography | 0|none | 0|acc | 0.7525|± |0.0307|
| - high_school_government_and_politics| 0|none | 0|acc | 0.9016|± |0.0215|
| - high_school_macroeconomics | 0|none | 0|acc | 0.6564|± |0.0241|
| - high_school_microeconomics | 0|none | 0|acc | 0.6807|± |0.0303|
| - high_school_psychology | 0|none | 0|acc | 0.8404|± |0.0157|
| - human_sexuality | 0|none | 0|acc | 0.7405|± |0.0384|
| - professional_psychology | 0|none | 0|acc | 0.6552|± |0.0192|
| - public_relations | 0|none | 0|acc | 0.6727|± |0.0449|
| - security_studies | 0|none | 0|acc | 0.7673|± |0.0270|
| - sociology | 0|none | 0|acc | 0.8358|± |0.0262|
| - us_foreign_policy | 0|none | 0|acc | 0.8600|± |0.0349|
| - stem |N/A |none | 0|acc | 0.5265|± |0.0085|
| - abstract_algebra | 0|none | 0|acc | 0.3100|± |0.0465|
| - anatomy | 0|none | 0|acc | 0.6000|± |0.0423|
| - astronomy | 0|none | 0|acc | 0.6842|± |0.0378|
| - college_biology | 0|none | 0|acc | 0.7292|± |0.0372|
| - college_chemistry | 0|none | 0|acc | 0.4700|± |0.0502|
| - college_chemistry | 0|none | 0|acc | 0.4700|± |0.0502| [5/1966]
| - college_computer_science | 0|none | 0|acc | 0.5600|± |0.0499|
| - college_mathematics | 0|none | 0|acc | 0.3500|± |0.0479|
| - college_physics | 0|none | 0|acc | 0.3529|± |0.0476|
| - computer_security | 0|none | 0|acc | 0.7100|± |0.0456|
| - conceptual_physics | 0|none | 0|acc | 0.5574|± |0.0325|
| - electrical_engineering | 0|none | 0|acc | 0.5793|± |0.0411|
| - elementary_mathematics | 0|none | 0|acc | 0.4101|± |0.0253|
| - high_school_biology | 0|none | 0|acc | 0.7903|± |0.0232|
| - high_school_chemistry | 0|none | 0|acc | 0.4828|± |0.0352|
| - high_school_computer_science | 0|none | 0|acc | 0.6600|± |0.0476|
| - high_school_mathematics | 0|none | 0|acc | 0.3444|± |0.0290|
| - high_school_physics | 0|none | 0|acc | 0.3642|± |0.0393|
| - high_school_statistics | 0|none | 0|acc | 0.5000|± |0.0341|
| - machine_learning | 0|none | 0|acc | 0.5268|± |0.0474|
|hellaswag | 1|none | 0|acc | 0.6537|± |0.0047|
| | |none | 0|acc_norm | 0.8429|± |0.0036|
|gsm8k | 3|strict-match | 5|exact_match| 0.7218|± |0.0123|
| | |flexible-extract| 5|exact_match| 0.7263|± |0.0123|
|boolq | 2|none | 0|acc | 0.8743|± |0.0058|
|arc_easy | 1|none | 0|acc | 0.8502|± |0.0073|
| | |none | 0|acc_norm | 0.8295|± |0.0077|
|arc_challenge | 1|none | 0|acc | 0.5964|± |0.0143|
| | |none | 0|acc_norm | 0.6169|± |0.0142|
| Groups |Version|Filter|n-shot| Metric | Value | |Stderr|
|------------------|-------|------|-----:|-----------|------:|---|-----:|
|truthfulqa |N/A |none | 0|rouge1_max |47.8575|± |0.8139|
| | |none | 0|bleu_max |21.9412|± |0.7280|
| | |none | 0|rouge2_max |32.7726|± |0.9228|
| | |none | 0|rougeL_diff|-1.4310|± |0.7806|
| | |none | 0|acc | 0.4955|± |0.0115|
| | |none | 0|bleu_diff |-0.2883|± |0.6228|
| | |none | 0|rouge2_acc | 0.3807|± |0.0170|
| | |none | 0|rougeL_max |44.1785|± |0.8274|
| | |none | 0|rougeL_acc | 0.4443|± |0.0174|
| | |none | 0|rouge2_diff|-1.5603|± |0.8950|
| | |none | 0|bleu_acc | 0.4321|± |0.0173|
| | |none | 0|rouge1_diff|-0.7276|± |0.7721|
| | |none | 0|rouge1_acc | 0.4774|± |0.0175|
|mmlu |N/A |none | 0|acc | 0.6300|± |0.0038|
| - humanities |N/A |none | 0|acc | 0.5779|± |0.0066|
| - other |N/A |none | 0|acc | 0.7090|± |0.0078|
| - social_sciences|N/A |none | 0|acc | 0.7358|± |0.0077|
| - stem |N/A |none | 0|acc | 0.5265|± |0.0085|