Spaces:
Build error
TestGenEval Benchmark Evaluation
This folder contains the evaluation harness for the TestGenEval benchmark, which is based on the original TestGenEval benchmark (paper). TestGenEval is designed to evaluate the ability of language models to generate unit tests for given Python functions.
Setup Environment and LLM Configuration
Follow the instructions here to set up your local development environment and configure your LLM.
Install the TestGenEval dependencies:
poetry install --with testgeneval
Run Inference
To generate tests using your model, run the following command:
./evaluation/benchmarks/testgeneval/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# Example
./evaluation/benchmarks/testgeneval/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 100 30 1 kjain14/testgenevallite test
Parameters:
model_config
: The config group name for your LLM settings (e.g.,eval_gpt4_1106_preview
)git-version
: The git commit hash or release tag of OpenHands to evaluate (e.g.,HEAD
or0.6.2
)agent
: The name of the agent for benchmarks (default:CodeActAgent
)eval_limit
: Limit the evaluation to the first N instances (optional)max_iter
: Maximum number of iterations for the agent to run (default: 30)num_workers
: Number of parallel workers for evaluation (default: 1)dataset
: HuggingFace dataset name (default:kjain14/testgenevallite
)dataset_split
: Dataset split to use (default:test
)
After running the inference, you will obtain an output.jsonl
file (by default saved to evaluation/evaluation_outputs
).
Evaluate Generated Tests
To evaluate the generated tests, use the eval_infer.sh
script:
./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split] [num_workers] [skip_mutation]
# Example
./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
Optional arguments:
instance_id
: Evaluate a single instance (optional)dataset_name
: Name of the dataset to use (default:kjain14/testgenevallite
)split
: Dataset split to use (default:test
)num_workers
: Number of workers for running docker (default: 1)skip_mutation
: Skip mutation testing (entertrue
if desired)
The evaluation results will be saved to evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/
with output.testgeneval.jsonl
containing the metrics.
Metrics
The TestGenEval benchmark evaluates generated tests based on the following metrics:
- Correctness: Measures if the generated tests are syntactically correct and run without errors.
- Coverage: Assesses the code coverage achieved by the generated tests.
- Mutation Score: Evaluates the effectiveness of the tests in detecting intentionally introduced bugs (mutations).
- Readability: Analyzes the readability of the generated tests using various metrics.
Submit Your Evaluation Results
To contribute your evaluation results:
- Fork our HuggingFace evaluation outputs.
- Add your results to the forked repository.
- Submit a Pull Request with your evaluation results following the guide here.
Additional Resources
For any questions or issues, please open an issue in the OpenHands repository.