Spaces:
Build error
Build error
# SWE-Interact Benchmark | |
This document explains how to use the [Interactive SWE-Bench](https://arxiv.org/abs/2502.13069) benchmark scripts for running and evaluating interactive software engineering tasks. | |
## Setting things up | |
After following the [README](./README.md) to set up the environment, you would need to additionally add LLM configurations for simulated human users. In the original [paper](https://arxiv.org/abs/2502.13069), we use gpt-4o as the simulated human user. You can add the following to your `config.toml` file: | |
```toml | |
[llm.fake_user] | |
model="litellm_proxy/gpt-4o-2024-08-06" | |
api_key="<your-api-key>" | |
temperature = 0.0 | |
base_url = "https://llm-proxy.eval.all-hands.dev" | |
``` | |
## Running the Benchmark | |
The main script for running the benchmark is `run_infer_interact.sh`. Here's how to use it: | |
```bash | |
bash ./evaluation/benchmarks/swe_bench/scripts/run_infer_interact.sh <model_config> <commit_hash> <agent> <eval_limit> <max_iter> <num_workers> <split> | |
``` | |
### Parameters: | |
- `model_config`: Path to the LLM configuration file (e.g., `llm.claude-3-7-sonnet`) | |
- `commit_hash`: Git commit hash to use (e.g., `HEAD`) | |
- `agent`: The agent class to use (e.g., `CodeActAgent`) | |
- `eval_limit`: Number of examples to evaluate (e.g., `500`) | |
- `max_iter`: Maximum number of iterations per task (e.g., `100`) | |
- `num_workers`: Number of parallel workers (e.g., `1`) | |
- `split`: Dataset split to use (e.g., `test`) | |
### Example: | |
```bash | |
bash ./evaluation/benchmarks/swe_bench/scripts/run_infer_interact.sh llm.claude-3-7-sonnet HEAD CodeActAgent 500 100 1 test | |
``` | |
### Additional Environment Variables: | |
You can customize the behavior using these environment variables: | |
- `RUN_WITH_BROWSING`: Enable/disable web browsing (default: false) | |
- `USE_HINT_TEXT`: Enable/disable hint text (default: false) | |
- `EVAL_CONDENSER`: Specify a condenser configuration | |
- `EXP_NAME`: Add a custom experiment name to the output | |
- `N_RUNS`: Number of runs to perform (default: 1) | |
- `SKIP_RUNS`: Comma-separated list of run numbers to skip | |
## Evaluating Results | |
After running the benchmark, you can evaluate the results using `eval_infer.sh`: | |
```bash | |
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh <output_file> <instance_id> <dataset> <split> | |
``` | |
### Parameters: | |
- `output_file`: Path to the output JSONL file | |
- `instance_id`: The specific instance ID to evaluate | |
- `dataset`: Dataset name (e.g., `cmu-lti/interactive-swe`) | |
- `split`: Dataset split (e.g., `test`) | |
### Example: | |
```bash | |
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/cmu-lti__interactive-swe-test/CodeActAgent/claude-3-7-sonnet-20250219_maxiter_100_N_v0.39.0-no-hint-run_1/output.jsonl sphinx-doc__sphinx-8721 cmu-lti/interactive-swe test | |
``` | |
## Output Structure | |
The benchmark outputs are stored in the `evaluation/evaluation_outputs/outputs/` directory with the following structure: | |
``` | |
evaluation/evaluation_outputs/outputs/ | |
βββ cmu-lti__interactive-swe-{split}/ | |
βββ {agent}/ | |
βββ {model}-{date}_maxiter_{max_iter}_N_{version}-{options}-run_{run_number}/ | |
βββ output.jsonl | |
``` | |
Where: | |
- `{split}` is the dataset split (e.g., test) | |
- `{agent}` is the agent class name | |
- `{model}` is the model name | |
- `{date}` is the run date | |
- `{max_iter}` is the maximum iterations | |
- `{version}` is the OpenHands version | |
- `{options}` includes any additional options (e.g., no-hint, with-browsing) | |
- `{run_number}` is the run number | |