Spaces:
Build error
SWE-Interact Benchmark
This document explains how to use the Interactive SWE-Bench benchmark scripts for running and evaluating interactive software engineering tasks.
Setting things up
After following the README to set up the environment, you would need to additionally add LLM configurations for simulated human users. In the original paper, we use gpt-4o as the simulated human user. You can add the following to your config.toml
file:
[llm.fake_user]
model="litellm_proxy/gpt-4o-2024-08-06"
api_key="<your-api-key>"
temperature = 0.0
base_url = "https://llm-proxy.eval.all-hands.dev"
Running the Benchmark
The main script for running the benchmark is run_infer_interact.sh
. Here's how to use it:
bash ./evaluation/benchmarks/swe_bench/scripts/run_infer_interact.sh <model_config> <commit_hash> <agent> <eval_limit> <max_iter> <num_workers> <split>
Parameters:
model_config
: Path to the LLM configuration file (e.g.,llm.claude-3-7-sonnet
)commit_hash
: Git commit hash to use (e.g.,HEAD
)agent
: The agent class to use (e.g.,CodeActAgent
)eval_limit
: Number of examples to evaluate (e.g.,500
)max_iter
: Maximum number of iterations per task (e.g.,100
)num_workers
: Number of parallel workers (e.g.,1
)split
: Dataset split to use (e.g.,test
)
Example:
bash ./evaluation/benchmarks/swe_bench/scripts/run_infer_interact.sh llm.claude-3-7-sonnet HEAD CodeActAgent 500 100 1 test
Additional Environment Variables:
You can customize the behavior using these environment variables:
RUN_WITH_BROWSING
: Enable/disable web browsing (default: false)USE_HINT_TEXT
: Enable/disable hint text (default: false)EVAL_CONDENSER
: Specify a condenser configurationEXP_NAME
: Add a custom experiment name to the outputN_RUNS
: Number of runs to perform (default: 1)SKIP_RUNS
: Comma-separated list of run numbers to skip
Evaluating Results
After running the benchmark, you can evaluate the results using eval_infer.sh
:
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh <output_file> <instance_id> <dataset> <split>
Parameters:
output_file
: Path to the output JSONL fileinstance_id
: The specific instance ID to evaluatedataset
: Dataset name (e.g.,cmu-lti/interactive-swe
)split
: Dataset split (e.g.,test
)
Example:
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/cmu-lti__interactive-swe-test/CodeActAgent/claude-3-7-sonnet-20250219_maxiter_100_N_v0.39.0-no-hint-run_1/output.jsonl sphinx-doc__sphinx-8721 cmu-lti/interactive-swe test
Output Structure
The benchmark outputs are stored in the evaluation/evaluation_outputs/outputs/
directory with the following structure:
evaluation/evaluation_outputs/outputs/
βββ cmu-lti__interactive-swe-{split}/
βββ {agent}/
βββ {model}-{date}_maxiter_{max_iter}_N_{version}-{options}-run_{run_number}/
βββ output.jsonl
Where:
{split}
is the dataset split (e.g., test){agent}
is the agent class name{model}
is the model name{date}
is the run date{max_iter}
is the maximum iterations{version}
is the OpenHands version{options}
includes any additional options (e.g., no-hint, with-browsing){run_number}
is the run number