Spaces:

Backup-bdg
/

OpenHands

Build error

App Files Files Community

OpenHands / evaluation /benchmarks /scienceagentbench /README.md

Backup-bdg

Upload 964 files

51ff9e5 verified 13 days ago

preview code

raw

history blame

2.65 kB

	# ScienceAgentBench Evaluation with OpenHands

	This folder contains the evaluation harness for [ScienceAgentBench](https://osu-nlp-group.github.io/ScienceAgentBench/) (paper: <https://arxiv.org/abs/2410.05080>).

	## Setup Environment and LLM Configuration

	Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

	## Setup ScienceAgentBench

	To prevent benchmark data contamination, we only provide the annotation sheet on [Huggingface](https://huggingface.co/datasets/osunlp/ScienceAgentBench), which includes all necessary inputs to run an agent.

	## Run Inference on ScienceAgentBench

	```bash
	./evaluation/benchmarks/scienceagentbench/scripts/run_infer.sh [model_config] [git-version] [use_knowledge] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

	# Example
	./evaluation/benchmarks/scienceagentbench/scripts/run_infer.sh llm.eval_gpt4o 0.9.3
	```

	where `model_config` is mandatory, and the rest are optional.

	- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
	LLM settings, as defined in your `config.toml`.
	- `git-version`, e.g. `HEAD`, is the git commit hash of the OpenHands version you would
	like to evaluate. It could also be a release tag like `0.6.2`.
	- `use_knowledge`, e.g. `true`, specifies whether allowing the agent to use expert-provided knowledge as additional input or not. By default, it is set to `false`.
	- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
	to `CodeActAgent`.
	- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By
	default, the script evaluates the entire SWE-bench_Lite test set (300 issues). Note:
	in order to use `eval_limit`, you must also set `agent`.
	- `max_iter`, e.g. `20`, is the maximum number of iterations for the agent to run. By
	default, it is set to 30.
	- `num_workers`, e.g. `3`, is the number of parallel workers to run the evaluation. By
	default, it is set to 1.

	## Evaluate Generated Programs

	### Extract Necessary Information from OpenHands Log

	After the inference is completed, you may use the following command to extract necessary information from the output log for evaluation:

	```bash
	python post_proc.py [log_fname]
	```

	- `log_fname`, e.g. `evaluation/.../output.jsonl`, is the automatically saved trajectory log of an OpenHands agent.

	Output will be write to e.g. `evaluation/.../output.converted.jsonl`

	### Run evaluation

	Please follow the steps [here](https://github.com/OSU-NLP-Group/ScienceAgentBench/tree/main?tab=readme-ov-file#evaluation-of-generated-code) to evaluate the generated programs.