Spaces:
Build error
Build error
# TestGenEval Benchmark Evaluation | |
This folder contains the evaluation harness for the TestGenEval benchmark, which is based on the original TestGenEval benchmark ([paper](https://arxiv.org/abs/2410.00752)). TestGenEval is designed to evaluate the ability of language models to generate unit tests for given Python functions. | |
## Setup Environment and LLM Configuration | |
1. Follow the instructions [here](../../README.md#setup) to set up your local development environment and configure your LLM. | |
2. Install the TestGenEval dependencies: | |
```bash | |
poetry install --with testgeneval | |
``` | |
## Run Inference | |
To generate tests using your model, run the following command: | |
```bash | |
./evaluation/benchmarks/testgeneval/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split] | |
# Example | |
./evaluation/benchmarks/testgeneval/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 100 30 1 kjain14/testgenevallite test | |
``` | |
Parameters: | |
- `model_config`: The config group name for your LLM settings (e.g., `eval_gpt4_1106_preview`) | |
- `git-version`: The git commit hash or release tag of OpenHands to evaluate (e.g., `HEAD` or `0.6.2`) | |
- `agent`: The name of the agent for benchmarks (default: `CodeActAgent`) | |
- `eval_limit`: Limit the evaluation to the first N instances (optional) | |
- `max_iter`: Maximum number of iterations for the agent to run (default: 30) | |
- `num_workers`: Number of parallel workers for evaluation (default: 1) | |
- `dataset`: HuggingFace dataset name (default: `kjain14/testgenevallite`) | |
- `dataset_split`: Dataset split to use (default: `test`) | |
After running the inference, you will obtain an `output.jsonl` file (by default saved to `evaluation/evaluation_outputs`). | |
## Evaluate Generated Tests | |
To evaluate the generated tests, use the `eval_infer.sh` script: | |
```bash | |
./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split] [num_workers] [skip_mutation] | |
# Example | |
./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl | |
``` | |
Optional arguments: | |
- `instance_id`: Evaluate a single instance (optional) | |
- `dataset_name`: Name of the dataset to use (default: `kjain14/testgenevallite`) | |
- `split`: Dataset split to use (default: `test`) | |
- `num_workers`: Number of workers for running docker (default: 1) | |
- `skip_mutation`: Skip mutation testing (enter `true` if desired) | |
The evaluation results will be saved to `evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with `output.testgeneval.jsonl` containing the metrics. | |
## Metrics | |
The TestGenEval benchmark evaluates generated tests based on the following metrics: | |
1. Correctness: Measures if the generated tests are syntactically correct and run without errors. | |
2. Coverage: Assesses the code coverage achieved by the generated tests. | |
3. Mutation Score: Evaluates the effectiveness of the tests in detecting intentionally introduced bugs (mutations). | |
4. Readability: Analyzes the readability of the generated tests using various metrics. | |
## Submit Your Evaluation Results | |
To contribute your evaluation results: | |
1. Fork [our HuggingFace evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation). | |
2. Add your results to the forked repository. | |
3. Submit a Pull Request with your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions). | |
## Additional Resources | |
- [TestGenEval Paper](https://arxiv.org/abs/2410.00752) | |
- [OpenHands Documentation](https://github.com/All-Hands-AI/OpenHands) | |
- [HuggingFace Datasets](https://huggingface.co/datasets) | |
For any questions or issues, please open an issue in the [OpenHands repository](https://github.com/All-Hands-AI/OpenHands/issues). | |