Saving and reading results

Saving results locally

Lighteval will automatically save results and evaluation details in the directory set with the --output-dir option. The results will be saved in {output_dir}/results/{model_name}/results_{timestamp}.json. Here is an example of a result file. The output path can be any fsspec compliant path (local, s3, hf hub, gdrive, ftp, etc).

To save the details of the evaluation, you can use the --save-details option. The details will be saved in a parquet file {output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet.

If you want results to be saved in a custom path, you can set the results-path-template option. This allows you to set a string template for the path. The template need to contain the following variables: output_dir, model_name, org. For example {output_dir}/{org}_{model}. The template will be used to create the path for the results file.

Pushing results to the HuggingFace hub

You can push the results and evaluation details to the HuggingFace hub. To do so, you need to set the --push-to-hub as well as the --results-org option. The results will be saved in a dataset with the name at {results_org}/{model_org}/{model_name}. To push the details, you need to set the --save-details option. The dataset created will be private by default, you can make it public by setting the --public-run option.

Pushing results to Tensorboard

You can push the results to Tensorboard by setting --push-to-tensorboard. This will create a Tensorboard dashboard in a HF org set with the --results-org option.

Pushing results to WandB

You can push the results to WandB by setting --wandb. This will init a WandB run and log the results.

Wandb args need to be set in your env variables.

export WANDB_PROJECT="lighteval"

You can find a list of variable in the wandb documentation.

How to load and investigate details

Load from local detail files

from datasets import load_dataset
import os

output_dir = "evals_doc"
model_name = "HuggingFaceH4/zephyr-7b-beta"
timestamp = "latest"
task = "lighteval|gsm8k|0"

if timestamp == "latest":
    path = f"{output_dir}/details/{model_org}/{model_name}/*/"
    timestamps = glob.glob(path)
    timestamp = sorted(timestamps)[-1].split("/")[-2]
    print(f"Latest timestamp: {timestamp}")

details_path = f"{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"

# Load the details
details = load_dataset("parquet", data_files=details_path, split="train")

for detail in details:
    print(detail)

Load from the HuggingFace hub

from datasets import load_dataset

results_org = "SaylorTwift"
model_name = "HuggingFaceH4/zephyr-7b-beta"
sanitized_model_name = model_name.replace("/", "__")
task = "lighteval|gsm8k|0"
public_run = False

dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")

for detail in details:
    print(detail)

The detail file contains the following columns:

doc: The doc used for the evaluation, this will contain the gold reference, the fewshots and other hyperparamters used for the task.
model_response: where you will find model generations, logprobs and the input that was sent to the model
metric: the value of the metrics for this sample

Example of a result file

{
  "config_general": {
    "lighteval_sha": "203045a8431bc9b77245c9998e05fc54509ea07f",
    "num_fewshot_seeds": 1,
    "override_batch_size": 1,
    "max_samples": 1,
    "job_id": "",
    "start_time": 620979.879320166,
    "end_time": 621004.632108041,
    "total_evaluation_time_secondes": "24.752787875011563",
    "model_name": "gpt2",
    "model_sha": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
    "model_dtype": null,
    "model_size": "476.2 MB"
  },
  "results": {
    "lighteval|gsm8k|0": {
      "qem": 0.0,
      "qem_stderr": 0.0,
      "maj@8": 0.0,
      "maj@8_stderr": 0.0
    },
    "all": {
      "qem": 0.0,
      "qem_stderr": 0.0,
      "maj@8": 0.0,
      "maj@8_stderr": 0.0
    }
  },
  "versions": {
    "lighteval|gsm8k|0": 0
  },
  "config_tasks": {
    "lighteval|gsm8k": {
      "name": "gsm8k",
      "prompt_function": "gsm8k",
      "hf_repo": "gsm8k",
      "hf_subset": "main",
      "metric": [
        {
          "metric_name": "qem",
          "higher_is_better": true,
          "category": "3",
          "use_case": "5",
          "sample_level_fn": "compute",
          "corpus_level_fn": "mean"
        },
        {
          "metric_name": "maj@8",
          "higher_is_better": true,
          "category": "5",
          "use_case": "5",
          "sample_level_fn": "compute",
          "corpus_level_fn": "mean"
        }
      ],
      "hf_avail_splits": [
        "train",
        "test"
      ],
      "evaluation_splits": [
        "test"
      ],
      "few_shots_split": null,
      "few_shots_select": "random_sampling_from_train",
      "generation_size": 256,
      "generation_grammar": null,
      "stop_sequence": [
        "Question="
      ],
      "num_samples": null,
      "suite": [
        "lighteval"
      ],
      "original_num_docs": 1319,
      "effective_num_docs": 1,
      "trust_dataset": true,
      "must_remove_duplicate_docs": null,
      "version": 0
    }
  },
  "summary_tasks": {
    "lighteval|gsm8k|0": {
      "hashes": {
        "hash_examples": "8517d5bf7e880086",
        "hash_full_prompts": "8517d5bf7e880086",
        "hash_input_tokens": "29916e7afe5cb51d",
        "hash_cont_tokens": "37f91ce23ef6d435"
      },
      "truncated": 2,
      "non_truncated": 0,
      "padded": 0,
      "non_padded": 2,
      "effective_few_shots": 0.0,
      "num_truncated_few_shots": 0
    }
  },
  "summary_general": {
    "hashes": {
      "hash_examples": "5f383c395f01096e",
      "hash_full_prompts": "5f383c395f01096e",
      "hash_input_tokens": "ac933feb14f96d7b",
      "hash_cont_tokens": "9d03fb26f8da7277"
    },
    "truncated": 2,
    "non_truncated": 0,
    "padded": 0,
    "non_padded": 2,
    "num_truncated_few_shots": 0
  }
}

< > Update on GitHub