TestGenEval Benchmark Evaluation

This folder contains the evaluation harness for the TestGenEval benchmark, which is based on the original TestGenEval benchmark (paper). TestGenEval is designed to evaluate the ability of language models to generate unit tests for given Python functions.

Setup Environment and LLM Configuration

Follow the instructions here to set up your local development environment and configure your LLM.
Install the TestGenEval dependencies:

poetry install --with testgeneval

Run Inference

To generate tests using your model, run the following command:

./evaluation/benchmarks/testgeneval/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

# Example
./evaluation/benchmarks/testgeneval/scripts/run_infer.sh llm.eval_gpt4_1106_preview HEAD CodeActAgent 100 30 1 kjain14/testgenevallite test

Parameters:

model_config: The config group name for your LLM settings (e.g., eval_gpt4_1106_preview)
git-version: The git commit hash or release tag of OpenHands to evaluate (e.g., HEAD or 0.6.2)
agent: The name of the agent for benchmarks (default: CodeActAgent)
eval_limit: Limit the evaluation to the first N instances (optional)
max_iter: Maximum number of iterations for the agent to run (default: 30)
num_workers: Number of parallel workers for evaluation (default: 1)
dataset: HuggingFace dataset name (default: kjain14/testgenevallite)
dataset_split: Dataset split to use (default: test)

After running the inference, you will obtain an output.jsonl file (by default saved to evaluation/evaluation_outputs).

Evaluate Generated Tests

To evaluate the generated tests, use the eval_infer.sh script:

./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh $YOUR_OUTPUT_JSONL [instance_id] [dataset_name] [split] [num_workers] [skip_mutation]

# Example
./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl

Optional arguments:

instance_id: Evaluate a single instance (optional)
dataset_name: Name of the dataset to use (default: kjain14/testgenevallite)
split: Dataset split to use (default: test)
num_workers: Number of workers for running docker (default: 1)
skip_mutation: Skip mutation testing (enter true if desired)

The evaluation results will be saved to evaluation/evaluation_outputs/outputs/kjain14__testgenevallite-test/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/ with output.testgeneval.jsonl containing the metrics.

Metrics

The TestGenEval benchmark evaluates generated tests based on the following metrics:

Correctness: Measures if the generated tests are syntactically correct and run without errors.
Coverage: Assesses the code coverage achieved by the generated tests.
Mutation Score: Evaluates the effectiveness of the tests in detecting intentionally introduced bugs (mutations).
Readability: Analyzes the readability of the generated tests using various metrics.

Submit Your Evaluation Results

To contribute your evaluation results:

Fork our HuggingFace evaluation outputs.
Add your results to the forked repository.
Submit a Pull Request with your evaluation results following the guide here.

Additional Resources

For any questions or issues, please open an issue in the OpenHands repository.