Spaces:
Build error
Build error
# Evaluation | |
This folder contains code and resources to run experiments and evaluations. | |
## For Benchmark Users | |
### Setup | |
Before starting evaluation, follow the instructions [here](https://github.com/All-Hands-AI/OpenHands/blob/main/Development.md) to setup your local development environment and LLM. | |
Once you are done with setup, you can follow the benchmark-specific instructions in each subdirectory of the [evaluation directory](#supported-benchmarks). | |
Generally these will involve running `run_infer.py` to perform inference with the agents. | |
### Implementing and Evaluating an Agent | |
To add an agent to OpenHands, you will need to implement it in the [agenthub directory](https://github.com/All-Hands-AI/OpenHands/tree/main/openhands/agenthub). There is a README there with more information. | |
To evaluate an agent, you can provide the agent's name to the `run_infer.py` program. | |
### Evaluating Different LLMs | |
OpenHands in development mode uses `config.toml` to keep track of most configuration. | |
**IMPORTANT: For evaluation, only the LLM section in `config.toml` will be used. Other configurations, such as `save_trajectory_path`, are not applied during evaluation.** | |
Here's an example configuration file you can use to define and use multiple LLMs: | |
```toml | |
[llm] | |
# IMPORTANT: add your API key here, and set the model to the one you want to evaluate | |
model = "gpt-4o-2024-05-13" | |
api_key = "sk-XXX" | |
[llm.eval_gpt4_1106_preview_llm] | |
model = "gpt-4-1106-preview" | |
api_key = "XXX" | |
temperature = 0.0 | |
[llm.eval_some_openai_compatible_model_llm] | |
model = "openai/MODEL_NAME" | |
base_url = "https://OPENAI_COMPATIBLE_URL/v1" | |
api_key = "XXX" | |
temperature = 0.0 | |
``` | |
### Configuring Condensers for Evaluation | |
For benchmarks that support condenser configuration (like SWE-Bench), you can define multiple condenser configurations in your `config.toml` file. A condenser is responsible for managing conversation history to maintain context while staying within token limits - you can learn more about how it works [here](https://www.all-hands.dev/blog/openhands-context-condensensation-for-more-efficient-ai-agents): | |
```toml | |
# LLM-based summarizing condenser for evaluation | |
[condenser.summarizer_for_eval] | |
type = "llm" | |
llm_config = "haiku" # Reference to an LLM config to use for summarization | |
keep_first = 2 # Number of initial events to always keep | |
max_size = 100 # Maximum size of history before triggering summarization | |
# Recent events condenser for evaluation | |
[condenser.recent_for_eval] | |
type = "recent" | |
keep_first = 2 # Number of initial events to always keep | |
max_events = 50 # Maximum number of events to keep in history | |
``` | |
You can then specify which condenser configuration to use when running evaluation scripts, for example: | |
```bash | |
EVAL_CONDENSER=summarizer_for_eval \ | |
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 500 100 1 princeton-nlp/SWE-bench_Verified test | |
``` | |
The name is up to you, but should match a name defined in your `config.toml` file. The last argument in the command specifies the condenser configuration to use. In this case, `summarizer_for_eval` is used, which refers to the LLM-based summarizing condenser as defined above. | |
If no condenser configuration is specified, the 'noop' condenser will be used by default, which keeps the full conversation history. | |
For other configurations specific to evaluation, such as `save_trajectory_path`, these are typically set in the `get_config` function of the respective `run_infer.py` file for each benchmark. | |
## Supported Benchmarks | |
The OpenHands evaluation harness supports a wide variety of benchmarks across [software engineering](#software-engineering), [web browsing](#web-browsing), [miscellaneous assistance](#misc-assistance), and [real-world](#real-world) tasks. | |
### Software Engineering | |
- SWE-Bench: [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench) | |
- HumanEvalFix: [`evaluation/benchmarks/humanevalfix`](./benchmarks/humanevalfix) | |
- BIRD: [`evaluation/benchmarks/bird`](./benchmarks/bird) | |
- BioCoder: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench) | |
- ML-Bench: [`evaluation/benchmarks/ml_bench`](./benchmarks/ml_bench) | |
- APIBench: [`evaluation/benchmarks/gorilla`](./benchmarks/gorilla/) | |
- ToolQA: [`evaluation/benchmarks/toolqa`](./benchmarks/toolqa/) | |
- AiderBench: [`evaluation/benchmarks/aider_bench`](./benchmarks/aider_bench/) | |
- Commit0: [`evaluation/benchmarks/commit0_bench`](./benchmarks/commit0_bench/) | |
- DiscoveryBench: [`evaluation/benchmarks/discoverybench`](./benchmarks/discoverybench/) | |
### Web Browsing | |
- WebArena: [`evaluation/benchmarks/webarena`](./benchmarks/webarena/) | |
- MiniWob++: [`evaluation/benchmarks/miniwob`](./benchmarks/miniwob/) | |
- Browsing Delegation: [`evaluation/benchmarks/browsing_delegation`](./benchmarks/browsing_delegation/) | |
### Misc. Assistance | |
- GAIA: [`evaluation/benchmarks/gaia`](./benchmarks/gaia) | |
- GPQA: [`evaluation/benchmarks/gpqa`](./benchmarks/gpqa) | |
- AgentBench: [`evaluation/benchmarks/agent_bench`](./benchmarks/agent_bench) | |
- MINT: [`evaluation/benchmarks/mint`](./benchmarks/mint) | |
- Entity deduction Arena (EDA): [`evaluation/benchmarks/EDA`](./benchmarks/EDA) | |
- ProofWriter: [`evaluation/benchmarks/logic_reasoning`](./benchmarks/logic_reasoning) | |
- ScienceAgentBench: [`evaluation/benchmarks/scienceagentbench`](./benchmarks/scienceagentbench) | |
### Real World | |
- TheAgentCompany: [`evaluation/benchmarks/the_agent_company`](./benchmarks/the_agent_company) | |
## Result Visualization | |
Check [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization of existing experimental results. | |
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions). | |
## For Benchmark Developers | |
To learn more about how to integrate your benchmark into OpenHands, check out [tutorial here](https://docs.all-hands.dev/usage/how-to/evaluation-harness). Briefly, | |
- Each subfolder contains a specific benchmark or experiment. For example, [`evaluation/benchmarks/swe_bench`](./benchmarks/swe_bench) should contain | |
all the preprocessing/evaluation/analysis scripts. | |
- Raw data and experimental records should not be stored within this repo. | |
- For model outputs, they should be stored at [this huggingface space](https://huggingface.co/spaces/OpenHands/evaluation) for visualization. | |
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo. | |