Lighteval documentation
Saving and reading results
Saving and reading results
Saving results locally
Lighteval will automatically save results and evaluation details in the
directory set with the --output-dir
option. The results will be saved in
{output_dir}/results/{model_name}/results_{timestamp}.json
. Here is an
example of a result file. The output path can be
any fsspec
compliant path (local, s3, hf hub, gdrive, ftp, etc).
To save the details of the evaluation, you can use the --save-details
option. The details will be saved in a parquet file
{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet
.
If you want results to be saved in a custom path, you can set the results-path-template
option.
This allows you to set a string template for the path. The template need to contain the following
variables: output_dir
, model_name
, org
. For example
{output_dir}/{org}_{model}
. The template will be used to create the path for the results file.
Pushing results to the HuggingFace hub
You can push the results and evaluation details to the HuggingFace hub. To do
so, you need to set the --push-to-hub
as well as the --results-org
option. The results will be saved in a dataset with the name at
{results_org}/{model_org}/{model_name}
. To push the details, you need to set
the --save-details
option.
The dataset created will be private by default, you can make it public by
setting the --public-run
option.
Pushing results to Tensorboard
You can push the results to Tensorboard by setting --push-to-tensorboard
.
This will create a Tensorboard dashboard in a HF org set with the --results-org
option.
Pushing results to WandB
You can push the results to WandB by setting --wandb
. This will init a WandB
run and log the results.
Wandb args need to be set in your env variables.
export WANDB_PROJECT="lighteval"
You can find a list of variable in the wandb documentation.
How to load and investigate details
Load from local detail files
from datasets import load_dataset
import os
output_dir = "evals_doc"
model_name = "HuggingFaceH4/zephyr-7b-beta"
timestamp = "latest"
task = "lighteval|gsm8k|0"
if timestamp == "latest":
path = f"{output_dir}/details/{model_org}/{model_name}/*/"
timestamps = glob.glob(path)
timestamp = sorted(timestamps)[-1].split("/")[-2]
print(f"Latest timestamp: {timestamp}")
details_path = f"{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"
# Load the details
details = load_dataset("parquet", data_files=details_path, split="train")
for detail in details:
print(detail)
Load from the HuggingFace hub
from datasets import load_dataset
results_org = "SaylorTwift"
model_name = "HuggingFaceH4/zephyr-7b-beta"
sanitized_model_name = model_name.replace("/", "__")
task = "lighteval|gsm8k|0"
public_run = False
dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")
for detail in details:
print(detail)
The detail file contains the following columns:
- doc: The doc used for the evaluation, this will contain the gold reference, the fewshots and other hyperparamters used for the task.
- model_response: where you will find model generations, logprobs and the input that was sent to the model
- metric: the value of the metrics for this sample
Example of a result file
{
"config_general": {
"lighteval_sha": "203045a8431bc9b77245c9998e05fc54509ea07f",
"num_fewshot_seeds": 1,
"override_batch_size": 1,
"max_samples": 1,
"job_id": "",
"start_time": 620979.879320166,
"end_time": 621004.632108041,
"total_evaluation_time_secondes": "24.752787875011563",
"model_name": "gpt2",
"model_sha": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
"model_dtype": null,
"model_size": "476.2 MB"
},
"results": {
"lighteval|gsm8k|0": {
"qem": 0.0,
"qem_stderr": 0.0,
"maj@8": 0.0,
"maj@8_stderr": 0.0
},
"all": {
"qem": 0.0,
"qem_stderr": 0.0,
"maj@8": 0.0,
"maj@8_stderr": 0.0
}
},
"versions": {
"lighteval|gsm8k|0": 0
},
"config_tasks": {
"lighteval|gsm8k": {
"name": "gsm8k",
"prompt_function": "gsm8k",
"hf_repo": "gsm8k",
"hf_subset": "main",
"metric": [
{
"metric_name": "qem",
"higher_is_better": true,
"category": "3",
"use_case": "5",
"sample_level_fn": "compute",
"corpus_level_fn": "mean"
},
{
"metric_name": "maj@8",
"higher_is_better": true,
"category": "5",
"use_case": "5",
"sample_level_fn": "compute",
"corpus_level_fn": "mean"
}
],
"hf_avail_splits": [
"train",
"test"
],
"evaluation_splits": [
"test"
],
"few_shots_split": null,
"few_shots_select": "random_sampling_from_train",
"generation_size": 256,
"generation_grammar": null,
"stop_sequence": [
"Question="
],
"num_samples": null,
"suite": [
"lighteval"
],
"original_num_docs": 1319,
"effective_num_docs": 1,
"trust_dataset": true,
"must_remove_duplicate_docs": null,
"version": 0
}
},
"summary_tasks": {
"lighteval|gsm8k|0": {
"hashes": {
"hash_examples": "8517d5bf7e880086",
"hash_full_prompts": "8517d5bf7e880086",
"hash_input_tokens": "29916e7afe5cb51d",
"hash_cont_tokens": "37f91ce23ef6d435"
},
"truncated": 2,
"non_truncated": 0,
"padded": 0,
"non_padded": 2,
"effective_few_shots": 0.0,
"num_truncated_few_shots": 0
}
},
"summary_general": {
"hashes": {
"hash_examples": "5f383c395f01096e",
"hash_full_prompts": "5f383c395f01096e",
"hash_input_tokens": "ac933feb14f96d7b",
"hash_cont_tokens": "9d03fb26f8da7277"
},
"truncated": 2,
"non_truncated": 0,
"padded": 0,
"non_padded": 2,
"num_truncated_few_shots": 0
}
}