Backup-bdg's picture
Upload 964 files
51ff9e5 verified
|
raw
history blame
2.52 kB

Multi-swe-bench Evaluation with OpenHands

LLM Setup

Please follow here.

Dataset Preparing

Please download the Multi-SWE-Bench dataset. And change the dataset following script.

python evaluation/benchmarks/multi_swe_bench/scripts/data/data_change.py

Docker image download

Please download the multi-swe-bench dokcer images from here.

Generate patch

Please edit the script and run it.

bash evaluation/benchmarks/multi_swe_bench/infer.sh

Script variable explanation:

  • models, e.g. llm.eval_gpt4_1106_preview, is the config group name for your LLM settings, as defined in your config.toml.
  • git-version, e.g. HEAD, is the git commit hash of the OpenHands version you would like to evaluate. It could also be a release tag like 0.6.2.
  • agent, e.g. CodeActAgent, is the name of the agent for benchmarks, defaulting to CodeActAgent.
  • eval_limit, e.g. 10, limits the evaluation to the first eval_limit instances. By default, the script evaluates the (500 issues), which will no exceed the maximum of the dataset number.
  • max_iter, e.g. 20, is the maximum number of iterations for the agent to run. By default, it is set to 50.
  • num_workers, e.g. 3, is the number of parallel workers to run the evaluation. By default, it is set to 1.
  • language, the language of your evaluating dataset.
  • dataset, the absolute position of the dataset jsonl.

The results will be generated in evaluation/evaluation_outputs/outputs/XXX/CodeActAgent/YYY/output.jsonl, you can refer to the example.

Runing evaluation

First, install multi-swe-bench.

pip install multi-swe-bench

Second, convert the output.jsonl to patch.jsonl with script, you can refer to the example.

python evaluation/benchmarks/multi_swe_bench/scripts/eval/convert.py

Finally, evaluate with multi-swe-bench. The config file config.json can be refer to the example or github.

python -m multi_swe_bench.harness.run_evaluation --config config.json