Evaluate documentation
Evaluator
Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes
The main entry point for using the evaluator:
evaluate.evaluator
< source >( task: str = None ) β Evaluator
Parameters
- task (
str
) β The task defining which evaluator will be returned. Currently accepted tasks are:"image-classification"
: will return a ImageClassificationEvaluator."question-answering"
: will return a QuestionAnsweringEvaluator."text-classification"
(alias"sentiment-analysis"
available): will return a TextClassificationEvaluator."token-classification"
: will return a TokenClassificationEvaluator.
Returns
An evaluator suitable for the task.
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverage pipeline
functionalify from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
The base class for all evaluator classes:
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.
check_required_columns
< source >( data: typing.Union[str, datasets.arrow_dataset.Dataset]columns_names: typing.Dict[str, str] )
Ensure the columns required for the evaluation are present in the dataset.
compute_metric
< source >( metric: EvaluationModulemetric_inputs: typing.Dictstrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999random_state: typing.Optional[int] = None )
Compute and return metrics.
get_dataset_split
< source >( datasubset = Nonesplit = None ) β split
Infers which split to use if None is given.
load_data
< source >( data: typing.Union[str, datasets.arrow_dataset.Dataset]subset: str = Nonesplit: str = None ) β data (Dataset
)
Parameters
- data (
Dataset
orstr
, defaults to None) β Specifies the dataset we will run evaluation on. If it is of - type
str
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. β - subset (
str
, defaults to None) β Specifies dataset subset to be passed toname
inload_dataset
. To be used with datasets with several configurations (e.g. glue/sst2). - split (
str
, defaults to None) β User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is astr
type, will automatically select the best one viachoose_split()
.
Returns
data (Dataset
)
Loaded dataset which will be used for evaluation.
Load dataset with given subset and split.
A core method of the Evaluator
class, which processes the pipeline outputs for compatibility with the metric.
prepare_data
< source >( data: Datasetinput_column: strlabel_column: str*args**kwargs ) β dict
Parameters
- data (
Dataset
) β Specifies the dataset we will run evaluation on. - input_column (
str
, defaults to"text"
) β the name of the column containing the text feature in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) β the name of the column containing the labels in the dataset specified bydata
.
Returns
dict
metric inputs.
list
: pipeline inputs.
Prepare data.
prepare_metric
< source >( metric: typing.Union[str, evaluate.module.EvaluationModule] )
Prepare metric.
prepare_pipeline
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')]tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = Nonefeature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = Nonedevice: int = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, β - defaults to
None
) β If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - preprocessor (
PreTrainedTokenizerBase
orFeatureExtractionMixin
, optional, defaults toNone
) β Argument can be used to overwrite a default preprocessor ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument.
Prepare pipeline.
The task specific evaluators
ImageClassificationEvaluator
class evaluate.ImageClassificationEvaluator
< source >( task = 'image-classification'default_metric_name = None )
Image classification evaluator.
This image classification evaluator can currently be loaded from evaluator() using the default task name
image-classification
.
Methods in this class assume a data format compatible with the ImageClassificationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonefeature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'image'label_column: str = 'label'label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="nateraw/vit-base-beans",
>>> data=data,
>>> label_column="labels",
>>> metric="accuracy",
>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>> strategy="bootstrap"
>>> )
QuestionAnsweringEvaluator
class evaluate.QuestionAnsweringEvaluator
< source >( task = 'question-answering'default_metric_name = None )
Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from evaluator() using the default task name
question-answering
.
Methods in this class assume a data format compatible with the
QuestionAnsweringPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Nonequestion_column: str = 'question'context_column: str = 'context'id_column: str = 'id'label_column: str = 'answers'squad_v2_format: typing.Optional[bool] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>> data=data,
>>> metric="squad",
>>> )
Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True
to
the compute() call.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>> data=data,
>>> metric="squad_v2",
>>> squad_v2_format=True,
>>> )
TextClassificationEvaluator
class evaluate.TextClassificationEvaluator
< source >( task = 'text-classification'default_metric_name = None )
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification
or with a "sentiment-analysis"
alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline
- a single textual
feature as input and a categorical label as output.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonefeature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'second_input_column: typing.Optional[str] = Nonelabel_column: str = 'label'label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )
TokenClassificationEvaluator
class evaluate.TokenClassificationEvaluator
< source >( task = 'token-classification'default_metric_name = None )
Token classification evaluator.
This token classification evaluator can currently be loaded from evaluator() using the default task name
token-classification
.
Methods in this class assume a data format compatible with the TokenClassificationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: str = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: typing.Optional[int] = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'tokens'label_column: str = 'ner_tags'join_by: typing.Optional[str] = ' ' )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>> data=data,
>>> metric="seqeval",
>>> )
For example, the following dataset format is accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
"ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
},
features=Features({
"tokens": Sequence(feature=Value(dtype="string")),
"ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
}),
)
For example, the following dataset format is not accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New York is a city and Felix a person."]],
"starts": [[0, 23]],
"ends": [[7, 27]],
"ner_tags": [["LOC", "PER"]],
},
features=Features({
"tokens": Value(dtype="string"),
"starts": Sequence(feature=Value(dtype="int32")),
"ends": Sequence(feature=Value(dtype="int32")),
"ner_tags": Sequence(feature=Value(dtype="string")),
}),
)
TextGenerationEvaluator
class evaluate.TextGenerationEvaluator
< source >( task = 'text-generation'default_metric_name = Nonepredictions_prefix: str = 'generated' )
Text generation evaluator.
This Text generation evaluator can currently be loaded from evaluator() using the default task name
text-generation
.
Methods in this class assume a data format compatible with the TextGenerationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonefeature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Text2TextGenerationEvaluator
class evaluate.Text2TextGenerationEvaluator
< source >( task = 'text2text-generation'default_metric_name = None )
Text2Text generation evaluator.
This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name
text2text-generation
.
Methods in this class assume a data format compatible with the Text2TextGenerationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. - input_column (
str
, defaults to"text"
) β the name of the column containing the input text in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) β the name of the column containing the labels in the dataset specified bydata
. - generation_kwargs (
Dict
, optional, defaults toNone
) β The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
SummarizationEvaluator
class evaluate.SummarizationEvaluator
< source >( task = 'summarization'default_metric_name = None )
Text summarization evaluator.
This text summarization evaluator can currently be loaded from evaluator() using the default task name
summarization
.
Methods in this class assume a data format compatible with the SummarizationEvaluator.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. - input_column (
str
, defaults to"text"
) β the name of the column containing the input text in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) β the name of the column containing the labels in the dataset specified bydata
. - generation_kwargs (
Dict
, optional, defaults toNone
) β The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
TranslationEvaluator
Translation evaluator.
This translation generation evaluator can currently be loaded from evaluator() using the default task name
translation
.
Methods in this class assume a data format compatible with the TranslationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'text'label_column: str = 'label'generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. - input_column (
str
, defaults to"text"
) β the name of the column containing the input text in the dataset specified bydata
. - label_column (
str
, defaults to"label"
) β the name of the column containing the labels in the dataset specified bydata
. - generation_kwargs (
Dict
, optional, defaults toNone
) β The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
AutomaticSpeechRecognitionEvaluator
class evaluate.AutomaticSpeechRecognitionEvaluator
< source >( task = 'automatic-speech-recognition'default_metric_name = None )
Automatic speech recognition evaluator.
This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name
automatic-speech-recognition
.
Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = Nonedata: typing.Union[str, datasets.arrow_dataset.Dataset] = Nonesubset: typing.Optional[str] = Nonesplit: typing.Optional[str] = Nonemetric: typing.Union[str, evaluate.module.EvaluationModule] = Nonetokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = Nonestrategy: typing.Literal['simple', 'bootstrap'] = 'simple'confidence_level: float = 0.95n_resamples: int = 9999device: int = Nonerandom_state: typing.Optional[int] = Noneinput_column: str = 'path'label_column: str = 'sentence'generation_kwargs: dict = None )
Parameters
- model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) β If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. - data (
str
orDataset
, defaults toNone
) β Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. - subset (
str
, defaults toNone
) β Defines which dataset subset to load. IfNone
is passed the default subset is loaded. - split (
str
, defaults toNone
) β Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. - metric (
str
orEvaluationModule
, defaults toNone
) β Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. - tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) β Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. - strategy (
Literal["simple", "bootstrap"]
, defaults to βsimpleβ) β specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
βsbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- confidence_level (
float
, defaults to0.95
) β Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. - n_resamples (
int
, defaults to9999
) β Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. - device (
int
, defaults toNone
) β Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. - random_state (
int
, optional, defaults toNone
) β Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>> data=data,
>>> input_column="path",
>>> label_column="sentence",
>>> metric="wer",
>>> )