Spaces:

NPHardEval
/

NPHardEval-leaderboard

Running

NPHardEval-leaderboard / src /display /about.py

haoyang

Update about.py

c36313e 12 months ago

6.49 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str


	# Init: to update with your specific keys
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	task0 = Task("SAS", "weighted_accuracy", "SAS")
	task1 = Task("SPP", "weighted_accuracy", "SPP")
	task2 = Task("EDP", "weighted_accuracy", "EDP")
	task3 = Task("TSP_D", "weighted_accuracy", "TSP_D")
	task4 = Task("GCP_D", "weighted_accuracy", "GCP_D")
	task5 = Task("KSP", "weighted_accuracy", "KSP")
	task6 = Task("TSP", "weighted_accuracy", "TSP")
	task7 = Task("GCP", "weighted_accuracy", "GCP")
	task8 = Task("MSP", "weighted_accuracy", "MSP")


	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">NPHardEval leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	<div align="center">
	<img
	src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/NPHardEval_text_right.jpg"
	style="width: 80%;"
	alt="Selected problems and the Euler diagram of computational complexity classes"
	>
	</div>
	NPHardEval serves as a comprehensive benchmark for assessing the reasoning abilities of large language models (LLMs) through the lens of computational complexity classes.
	"""

	# Which evaluations are you running? how can people reproduce what you have?
	LLM_BENCHMARKS_TEXT = f"""
	The paramount importance of complex reasoning in Large Language Models (LLMs) is well-recognized,
	especially in their application to intricate decision-making tasks. This underscores the necessity
	of thoroughly investigating LLMs' reasoning capabilities. To this end, various benchmarks have been
	developed to evaluate these capabilities. However, existing benchmarks fall short in providing a
	comprehensive assessment of LLMs' potential in reasoning. Additionally, there is a risk of overfitting,
	as these benchmarks are static and publicly accessible, allowing models to tailor responses to specific
	metrics, thus artificially boosting their performance.

	In response, our research introduces 'NPHardEval,' a novel benchmark meticulously designed to
	comprehensively evaluate LLMs' reasoning abilities. It comprises a diverse array of 900 algorithmic
	questions, spanning the spectrum up to NP-Hard complexity. These questions are strategically selected
	to cover a vast range of complexities, ensuring a thorough evaluation of LLMs' reasoning power. This
	benchmark not only offers insights into the current state of reasoning in LLMs but also establishes
	a benchmark for comparing LLMs' performance across various complexity classes.

	[Our repository](https://github.com/casmlab/NPHardEval) contains datasets, data generation scripts, and experimental procedures designed to evaluate LLMs in various reasoning tasks.
	In particular, we use three complexity classes to define the task complexity in the benchmark, including P (polynomial time), NP-complete (nondeterministic polynomial-time complete),
	and NP-hard, which are increasingly complex in both the intrinsic difficulty and the resources needed to solve them. The selected nine problems are:
	1) P problems: Sorted Array Search (SAS), Edit Distance Problem (EDP), Shortest Path Problem (SPP);
	2) NP-complete problems: Traveling Salesman Problem Decision Version (TSP-D), Graph Coloring Problem Decision Version (GCP-D), and Knapsack Problem (KSP);
	3) NP-hard problems: Traveling Salesman Problem Optimization Version (TSP), Graph Coloring Problem Optimization Version (GCP), and Meeting Scheduling Problem (MSP).

	The following figure shows their relation regarding computational complexity in an Euler diagram.

	<div align="center">
	<img
	src="https://raw.githubusercontent.com/casmlab/NPHardEval/main/figure/NP-hard.jpg"
	style="width: 50%;"
	alt="Selected problems and the Euler diagram of computational complexity classes"
	>
	</div>

	Our benchmark offers several advantages compared with current benchmarks:
	- Data construction grounded in the established computational complexity hierarchy
	- Automatic checking mechanisms
	- Automatic generation of datapoints
	- Complete focus on reasoning while exclude numerical computation

	Our study marks a significant contribution to understanding LLMs' current reasoning capabilities
	and paves the way for future enhancements. Furthermore, NPHardEval features a dynamic update mechanism,
	refreshing data points monthly. This approach is crucial in reducing the risk of model overfitting,
	leading to a more accurate and dependable evaluation of LLMs' reasoning skills. The benchmark dataset
	and the associated code are accessible at [NPHardEval GitHub Repository]("https://github.com/casmlab/NPHardEval").

	## Quick Start
	### Environment setup
	```bash
	conda create --name llm_reason python=3.10
	conda activate llm_reason
	git clone https://github.com/casmlab/NPHardEval.git
	pip install -r requirements.txt
	```

	### Set-up API keys
	Please set up your API keys in `secrets.txt`. Please don't directly upload your keys to any public repository.

	### Example Commands
	Let's use the GPT 4 Turbo model (GPT-4-1106-preview) and the EDP for example.

	For its zeroshot experiment, you can use:
	```
	cd run
	cd run_close_zeroshot
	python run_hard_GCP.py gpt-4-1106-preview
	```

	For its fewshot experiment,
	```
	cd run
	cd run_close_fewshot
	python run_close_fewshot/run_hard_GCP.py gpt-4-1106-preview self
	```
	"""

	EVALUATION_QUEUE_TEXT = """
	Currently, we don't support the submission of new evaluations.
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results."

	CITATION_BUTTON_TEXT = r"""
	@misc{fan2023nphardeval,
	title={NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes},
	author={Lizhou Fan and Wenyue Hua and Lingyao Li and Haoyang Ling and Yongfeng Zhang and Libby Hemphill},
	year={2023},
	eprint={2312.14890},
	archivePrefix={arXiv},
	primaryClass={cs.AI}
	}
	"""

	COMMENT_BUTTON_TEXT = """
	We refer to the following links to mark the model parameters. Contact us if you find any issues.
	- https://grabon.com/blog/claude-users-statistics/
	- https://medium.com/@seanbetts/peering-inside-gpt-4-understanding-its-mixture-of-experts-moe-architecture-2a42eb8bdcb3
	- https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
	"""