Evaluate documentation


Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started


The evaluator classes for automatic evaluation.

Evaluator classes

The main entry point for using the evaluator:


< >

( task: str = None ) β†’ Evaluator




An evaluator suitable for the task.

Utility factory method to build an Evaluator. Evaluators encapsulate a task and a default metric name. They leverage pipeline functionalify from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.


>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

The base class for all evaluator classes:

class evaluate.Evaluator

< >

( task: strdefault_metric_name: str = None )

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.


< >

( data: typing.Union[str, datasets.arrow_dataset.Dataset]columns_names: typing.Dict[str, str] )


  • data (str or Dataset) β€” Specifies the dataset we will run evaluation on.
  • columns_names (List[str]) β€”
  • List of column names to check in the dataset. The keys are the arguments to the compute() method, β€”
  • while the values are the column names to check. β€”

Ensure the columns required for the evaluation are present in the dataset.


< >

( metric: EvaluationModulemetric_inputs: typing.Dictstrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999random_state: typing.Optional[int] = None )

Compute and return metrics.


< >

( datasubset = Nonesplit = None ) β†’ split


  • data (str) β€” Name of dataset
  • subset (str) β€” Name of config for datasets with multiple configurations (e.g. β€˜glue/cola’)
  • split (str, defaults to None) β€” Split to use



str containing which split to use

Infers which split to use if None is given.


< >

( data: typing.Union[str, datasets.arrow_dataset.Dataset]subset: str = Nonesplit: str = None ) β†’ data (Dataset)


  • data (Dataset or str, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of
  • type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. β€”
  • subset (str, defaults to None) β€” Specifies dataset subset to be passed to name in load_dataset. To be used with datasets with several configurations (e.g. glue/sst2).
  • split (str, defaults to None) β€” User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is a str type, will automatically select the best one via choose_split().


data (Dataset)

Loaded dataset which will be used for evaluation.

Load dataset with given subset and split.


< >

( *args**kwargs )

A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.


< >

( data: Datasetinput_column: strlabel_column: str*args**kwargs ) β†’ dict


  • data (Dataset) β€” Specifies the dataset we will run evaluation on.
  • input_column (str, defaults to "text") β€” the name of the column containing the text feature in the dataset specified by data.
  • label_column (str, defaults to "label") β€” the name of the column containing the labels in the dataset specified by data.



metric inputs. list: pipeline inputs.

Prepare data.


< >

( metric: typing.Union[str, evaluate.module.EvaluationModule] )


  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

Prepare metric.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')]tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = Nonefeature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = Nonedevice: int = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, β€”
  • defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) β€” Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

Prepare pipeline.

The task specific evaluators


class evaluate.ImageClassificationEvaluator

< >

( task = 'image-classification'default_metric_name = None )

Image classification evaluator. This image classification evaluator can currently be loaded from evaluator() using the default task name image-classification. Methods in this class assume a data format compatible with the ImageClassificationPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonefeature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'image'label_column: str = 'label'label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.


>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )


class evaluate.QuestionAnsweringEvaluator

< >

( task = 'question-answering'default_metric_name = None )

Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.

This question answering evaluator can currently be loaded from evaluator() using the default task name question-answering.

Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Nonequestion_column: str = 'question'context_column: str = 'context'id_column: str = 'id'label_column: str = 'answers'squad_v2_format: typing.Optional[bool] = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.


>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to the compute() call.

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )


class evaluate.TextClassificationEvaluator

< >

( task = 'text-classification'default_metric_name = None )

Text classification evaluator. This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias. Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonefeature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'second_input_column: typing.Optional[str] = Nonelabel_column: str = 'label'label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.


>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )


class evaluate.TokenClassificationEvaluator

< >

( task = 'token-classification'default_metric_name = None )

Token classification evaluator.

This token classification evaluator can currently be loaded from evaluator() using the default task name token-classification.

Methods in this class assume a data format compatible with the TokenClassificationPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: str = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: typing.Optional[int] = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'tokens'label_column: str = 'ner_tags'join_by: typing.Optional[str] = ' ' )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.

The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.


>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>>     data=data,
>>>     metric="seqeval",
>>> )

For example, the following dataset format is accepted by the evaluator:

dataset = Dataset.from_dict(
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),

For example, the following dataset format is not accepted by the evaluator:

dataset = Dataset.from_dict(
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),


class evaluate.TextGenerationEvaluator

< >

( task = 'text-generation'default_metric_name = Nonepredictions_prefix: str = 'generated' )

Text generation evaluator. This Text generation evaluator can currently be loaded from evaluator() using the default task name text-generation. Methods in this class assume a data format compatible with the TextGenerationPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonefeature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )


class evaluate.Text2TextGenerationEvaluator

< >

( task = 'text2text-generation'default_metric_name = None )

Text2Text generation evaluator. This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name text2text-generation. Methods in this class assume a data format compatible with the Text2TextGenerationPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'generation_kwargs: dict = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "text") β€” the name of the column containing the input text in the dataset specified by data.
  • label_column (str, defaults to "label") β€” the name of the column containing the labels in the dataset specified by data.
  • generation_kwargs (Dict, optional, defaults to None) β€” The generation kwargs are passed to the pipeline and set the text generation strategy.

Compute the metric for a given pipeline and dataset combination.


class evaluate.SummarizationEvaluator

< >

( task = 'summarization'default_metric_name = None )

Text summarization evaluator. This text summarization evaluator can currently be loaded from evaluator() using the default task name summarization. Methods in this class assume a data format compatible with the SummarizationEvaluator.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'generation_kwargs: dict = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "text") β€” the name of the column containing the input text in the dataset specified by data.
  • label_column (str, defaults to "label") β€” the name of the column containing the labels in the dataset specified by data.
  • generation_kwargs (Dict, optional, defaults to None) β€” The generation kwargs are passed to the pipeline and set the text generation strategy.

Compute the metric for a given pipeline and dataset combination.


class evaluate.TranslationEvaluator

< >

( task = 'translation'default_metric_name = None )

Translation evaluator. This translation generation evaluator can currently be loaded from evaluator() using the default task name translation. Methods in this class assume a data format compatible with the TranslationPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'generation_kwargs: dict = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:

  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
  • input_column (str, defaults to "text") β€” the name of the column containing the input text in the dataset specified by data.
  • label_column (str, defaults to "label") β€” the name of the column containing the labels in the dataset specified by data.
  • generation_kwargs (Dict, optional, defaults to None) β€” The generation kwargs are passed to the pipeline and set the text generation strategy.

Compute the metric for a given pipeline and dataset combination.


class evaluate.AutomaticSpeechRecognitionEvaluator

< >

( task = 'automatic-speech-recognition'default_metric_name = None )

Automatic speech recognition evaluator. This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name automatic-speech-recognition. Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline.


< >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'path'label_column: str = 'sentence'generation_kwargs: dict = None )


  • model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) β€” If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
  • data (str or Dataset, defaults to None) β€” Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
  • subset (str, defaults to None) β€” Defines which dataset subset to load. If None is passed the default subset is loaded.
  • split (str, defaults to None) β€” Defines which dataset split to load. If None is passed, infers based on the choose_split function.
  • metric (str or EvaluationModule, defaults to None) β€” Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
  • tokenizer (str or PreTrainedTokenizer, optional, defaults to None) β€” Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
  • strategy (Literal["simple", "bootstrap"], defaults to β€œsimple”) β€” specifies the evaluation strategy. Possible values are:
  • confidence_level (float, defaults to 0.95) β€” The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
  • n_resamples (int, defaults to 9999) β€” The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
  • device (int, defaults to None) β€” Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. If None is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
  • random_state (int, optional, defaults to None) β€” The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.

Compute the metric for a given pipeline and dataset combination.


>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>>     data=data,
>>>     input_column="path",
>>>     label_column="sentence",
>>>     metric="wer",
>>> )