Spaces:
Build error
Build error
Multi-swe-bench Evaluation with OpenHands
LLM Setup
Please follow here.
Dataset Preparing
Please download the Multi-SWE-Bench dataset. And change the dataset following script.
python evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py
Docker image download
Please download the multi-swe-bench dokcer images from here.
Generate patch
Please edit the script and run it.
bash evaluation/benchmarks/multi_swe_bench/infer.sh
Script variable explanation:
models
, e.g.llm.eval_gpt4_1106_preview
, is the config group name for your LLM settings, as defined in yourconfig.toml
.git-version
, e.g.HEAD
, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like0.6.2
.agent
, e.g.CodeActAgent
, is the name of the agent for benchmarks, defaulting toCodeActAgent
.eval_limit
, e.g.10
, limits the evaluation to the firsteval_limit
instances. By default, the script evaluates the (500 issues), which will no exceed the maximum of the dataset number.max_iter
, e.g.20
, is the maximum number of iterations for the agent to run. By default, it is set to 50.num_workers
, e.g.3
, is the number of parallel workers to run the evaluation. By default, it is set to 1.language
, the language of your evaluating dataset.dataset
, the absolute position of the dataset jsonl.
The results will be generated in evaluation/evaluation_outputs/outputs/XXX/CodeActAgent/YYY/output.jsonl, you can refer to the example.
Runing evaluation
First, install multi-swe-bench.
pip install multi-swe-bench
Second, convert the output.jsonl to patch.jsonl with script, you can refer to the example.
python evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py
Finally, evaluate with multi-swe-bench. The config file config.json can be refer to the example or github.
python -m multi_swe_bench.harness.run_evaluation --config config.json