Add `Evaluation Results` Table

d8ca261 verified 4 months ago

5.93 kB

	---
	base_model:
	- AI-MO/NuminaMath-7B-TIR
	- deepseek-ai/DeepSeek-Prover-V1.5-RL
	tags:
	- merge
	- mergekit
	- lazymergekit
	- AI-MO/NuminaMath-7B-TIR
	- deepseek-ai/DeepSeek-Prover-V1.5-RL
	license: apache-2.0
	model-index:
	- name: Mathmate-7B-DELLA
	results:
	- task:
	type: text-generation
	dataset:
	name: AGIEval
	type: AGIEval
	metrics:
	- name: AGIEval
	type: AGIEval
	value: 21.95
	- task:
	type: text-generation
	dataset:
	name: GPT4All
	type: GPT4All
	metrics:
	- name: GPT4All
	type: GPT4All
	value: 36.5
	- task:
	type: text-generation
	dataset:
	name: TruthfulQA
	type: TruthfulQA
	metrics:
	- name: TruthfulQA
	type: TruthfulQA
	value: 48.08
	- task:
	type: text-generation
	dataset:
	name: Bigbench
	type: Bigbench
	metrics:
	- name: Bigbench
	type: Bigbench
	value: 28.89
	---

	# Mathmate-7B-DELLA

	Mathmate-7B-DELLA is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
	* [AI-MO/NuminaMath-7B-TIR](https://huggingface.co/AI-MO/NuminaMath-7B-TIR)
	* [deepseek-ai/DeepSeek-Prover-V1.5-RL](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V1.5-RL)

	## 🧩 Configuration

	```yaml
	models:
	- model: AI-MO/NuminaMath-7B-TIR
	parameters:
	density: 0.5
	weight: 0.3
	- model: deepseek-ai/DeepSeek-Prover-V1.5-RL
	parameters:
	density: 0.5
	weight: 0.2
	merge_method: della
	base_model: deepseek-ai/deepseek-math-7b-base
	parameters:
	normalize: true
	dtype: bfloat16
	```

	## 💻 Usage

	```python
	!pip install -qU transformers accelerate

	from transformers import AutoTokenizer
	import transformers
	import torch

	model = "Haleshot/Mathmate-7B-DELLA"
	messages = [{"role": "user", "content": "What is a large language model?"}]

	tokenizer = AutoTokenizer.from_pretrained(model)
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	pipeline = transformers.pipeline(
	"text-generation",
	model=model,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
	print(outputs[0]["generated_text"])
	```

	## 📊 Evaluation Results

	Evaluation results using LLMAutoeval:

	\| Model \| AGIEval \| GPT4All \| TruthfulQA \| Bigbench \| Average \|
	\|-------\|---------\|---------\|------------\|----------\|---------\|
	\| [Mathmate-7B-DELLA](https://huggingface.co/Haleshot/Mathmate-7B-DELLA) \| 21.95 \| 36.5 \| 48.08 \| 28.89 \| 33.86 \|

	### AGIEval
	\| Task \| Version \| Metric \| Value \| Stderr \|
	\|------\|---------\|--------\|-------\|--------\|
	\| agieval_aqua_rat \| 0 \| acc \| 21.26 \| 2.57 \|
	\| \| \| acc_norm \| 22.05 \| 2.61 \|
	\| agieval_logiqa_en \| 0 \| acc \| 20.89 \| 1.59 \|
	\| \| \| acc_norm \| 25.65 \| 1.71 \|
	\| agieval_lsat_ar \| 0 \| acc \| 21.74 \| 2.73 \|
	\| \| \| acc_norm \| 19.57 \| 2.62 \|
	\| agieval_lsat_lr \| 0 \| acc \| 13.92 \| 1.53 \|
	\| \| \| acc_norm \| 18.82 \| 1.73 \|
	\| agieval_lsat_rc \| 0 \| acc \| 21.19 \| 2.50 \|
	\| \| \| acc_norm \| 18.96 \| 2.39 \|
	\| agieval_sat_en \| 0 \| acc \| 24.76 \| 3.01 \|
	\| \| \| acc_norm \| 21.36 \| 2.86 \|
	\| agieval_sat_en_without_passage \| 0 \| acc \| 27.18 \| 3.11 \|
	\| \| \| acc_norm \| 23.30 \| 2.95 \|
	\| agieval_sat_math \| 0 \| acc \| 25.45 \| 2.94 \|
	\| \| \| acc_norm \| 25.91 \| 2.96 \|

	Average: 21.95%

	### GPT4All
	\| Task \| Version \| Metric \| Value \| Stderr \|
	\|------\|---------\|--------\|-------\|--------\|
	\| arc_challenge \| 0 \| acc \| 22.61 \| 1.22 \|
	\| \| \| acc_norm \| 25.68 \| 1.28 \|
	\| arc_easy \| 0 \| acc \| 25.25 \| 0.89 \|
	\| \| \| acc_norm \| 25.08 \| 0.89 \|
	\| boolq \| 1 \| acc \| 52.02 \| 0.87 \|
	\| hellaswag \| 0 \| acc \| 25.77 \| 0.44 \|
	\| \| \| acc_norm \| 26.09 \| 0.44 \|
	\| openbookqa \| 0 \| acc \| 18.40 \| 1.73 \|
	\| \| \| acc_norm \| 28.80 \| 2.03 \|
	\| piqa \| 0 \| acc \| 51.31 \| 1.17 \|
	\| \| \| acc_norm \| 50.11 \| 1.17 \|
	\| winogrande \| 0 \| acc \| 47.75 \| 1.40 \|

	Average: 36.5%

	### TruthfulQA
	\| Task \| Version \| Metric \| Value \| Stderr \|
	\|------\|---------\|--------\|-------\|--------\|
	\| truthfulqa_mc \| 1 \| mc1 \| 22.77 \| 1.47 \|
	\| \| \| mc2 \| 48.08 \| 1.70 \|

	Average: 48.08%

	### Bigbench
	\| Task \| Version \| Metric \| Value \| Stderr \|
	\|------\|---------\|--------\|-------\|--------\|
	\| bigbench_causal_judgement \| 0 \| multiple_choice_grade \| 49.47 \| 3.64 \|
	\| bigbench_date_understanding \| 0 \| multiple_choice_grade \| 13.55 \| 1.78 \|
	\| bigbench_disambiguation_qa \| 0 \| multiple_choice_grade \| 30.23 \| 2.86 \|
	\| bigbench_geometric_shapes \| 0 \| multiple_choice_grade \| 10.03 \| 1.59 \|
	\| \| \| exact_str_match \| 0.00 \| 0.00 \|
	\| bigbench_logical_deduction_five_objects \| 0 \| multiple_choice_grade \| 19.40 \| 1.77 \|
	\| bigbench_logical_deduction_seven_objects \| 0 \| multiple_choice_grade \| 14.00 \| 1.31 \|
	\| bigbench_logical_deduction_three_objects \| 0 \| multiple_choice_grade \| 36.67 \| 2.79 \|
	\| bigbench_movie_recommendation \| 0 \| multiple_choice_grade \| 23.60 \| 1.90 \|
	\| bigbench_navigate \| 0 \| multiple_choice_grade \| 47.10 \| 1.58 \|
	\| bigbench_reasoning_about_colored_objects \| 0 \| multiple_choice_grade \| 13.05 \| 0.75 \|
	\| bigbench_ruin_names \| 0 \| multiple_choice_grade \| 53.79 \| 2.36 \|
	\| bigbench_salient_translation_error_detection \| 0 \| multiple_choice_grade \| 15.63 \| 1.15 \|
	\| bigbench_snarks \| 0 \| multiple_choice_grade \| 46.96 \| 3.72 \|
	\| bigbench_sports_understanding \| 0 \| multiple_choice_grade \| 49.70 \| 1.59 \|
	\| bigbench_temporal_sequences \| 0 \| multiple_choice_grade \| 25.80 \| 1.38 \|
	\| bigbench_tracking_shuffled_objects_five_objects \| 0 \| multiple_choice_grade \| 19.76 \| 1.13 \|
	\| bigbench_tracking_shuffled_objects_seven_objects \| 0 \| multiple_choice_grade \| 14.69 \| 0.85 \|
	\| bigbench_tracking_shuffled_objects_three_objects \| 0 \| multiple_choice_grade \| 36.67 \| 2.79 \|

	Average: 28.89%

	Average score: 33.86%

	Elapsed time: 03:52:09