Multi-swe-bench Evaluation with OpenHands

LLM Setup

Please follow here.

Please download the Multi-SWE-Bench dataset. And change the dataset following script.

python evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py

Please download the multi-swe-bench dokcer images from here.

Please edit the script and run it.

bash evaluation/benchmarks/multi_swe_bench/infer.sh

Script variable explanation:

models, e.g. llm.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.
agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the (500 issues), which will no exceed the maximum of the dataset number.
max_iter, e.g. 20, is the maximum number of iterations for the agent to run. By default, it is set to 50.
num_workers, e.g. 3, is the number of parallel workers to run the evaluation. By default, it is set to 1.
language, the language of your evaluating dataset.
dataset, the absolute position of the dataset jsonl.

The results will be generated in evaluation/evaluation_outputs/outputs/XXX/CodeActAgent/YYY/output.jsonl, you can refer to the example.

First, install multi-swe-bench.

pip install multi-swe-bench

Second, convert the output.jsonl to patch.jsonl with script, you can refer to the example.

python evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py

Finally, evaluate with multi-swe-bench. The config file config.json can be refer to the example or github.

python -m multi_swe_bench.harness.run_evaluation --config config.json