Turdus / README.md

Adding Evaluation Results

a1e1a5f verified 11 months ago

5.31 kB

	---
	license: cc-by-nc-4.0
	tags:
	- mlabonne/NeuralMarcoro14-7B
	- dpo
	- 7B
	- winograd
	- mmlu_abstract_algebra
	- mistral
	datasets:
	- hromi/winograd_dpo_basic
	base_model: mlabonne/NeuralMarcoro14-7B
	model-index:
	- name: Turdus
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 73.38
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=udkai/Turdus
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 88.56
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=udkai/Turdus
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.52
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=udkai/Turdus
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 67.11
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=udkai/Turdus
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 86.66
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=udkai/Turdus
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 67.7
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=udkai/Turdus
	name: Open LLM Leaderboard
	---

	![](https://wizzion.com/solarpunk_turdus.webp)

	# udkai_Turdus
	A less contaminated version of [udkai/Garrulus](https://huggingface.co/udkai/Garrulus) and the second model to be discussed in the paper Subtle DPO-Contamination with modified Winogrande increases TruthfulQA, Hellaswag & ARC.

	Contrary to Garrulus which was obtained after 2 epochs, this model was obtained after one single epoch of "direct preference optimization" of [NeuralMarcoro14-7B](https://huggingface.co/mlabonne/NeuralMarcoro14-7B) with [https://huggingface.co/datasets/hromi/winograd_dpo ] .

	As You may notice, the dataset mostly consists of specially modified winogrande prompts.

	But before flagging this (or recommending this to be flagged), consider this:

	Subtle DPO-Contamination with modified Winogrande causes the average accuracy of all 5-non Winogrande metrics (e.g. including also MMLU and GSM8K) to be 0.2% higher than the underlying model.

	\| Model \| ARC \| HellaSwag \| MMLU \| Truthful QA \| GSM8K \| Average \|
	\| -----------------------------\|------ \| --------- \| ---- \| ----------- \| ------\| ------- \|
	\| mlabonne/NeuralMarcoro14-7B \| 71.42 \| 87.59 \| 64.84\| 65.64 \| 70.74 \| 72.046 \|
	\| udkai/Turdus \| 73.38 \| 88.56 \| 64.52\| 67.11 \| 67.7 \| 72,254 \|

	Yes, as strange as it may sound, one can indeed increase ARC from 71.42% to 73.38 % with one single epoch of cca 1200 repetitive winograd schematas...

	# BibTex
	Should this model - or quasi-methodology which lead to it - be of certain pratical or theoretical interest for You, would be honored if You would refer to it in Your work:

	```
	@misc {udk_dot_ai_turdus,
	author = { {UDK dot AI, Daniel Devatman Hromada} },
	title = { Turdus (Revision 923c305) },
	year = 2024,
	url = { https://huggingface.co/udkai/Turdus },
	doi = { 10.57967/hf/1611 },
	publisher = { Hugging Face }
	}
	```
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_udkai__Turdus)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|74.66\|
	\|AI2 Reasoning Challenge (25-Shot)\|73.38\|
	\|HellaSwag (10-Shot) \|88.56\|
	\|MMLU (5-Shot) \|64.52\|
	\|TruthfulQA (0-shot) \|67.11\|
	\|Winogrande (5-shot) \|86.66\|
	\|GSM8k (5-shot) \|67.70\|