Spaces:
Runtime error
Runtime error
<!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# Processors | |
Processors can mean two different things in the Transformers library: | |
- the objects that pre-process inputs for multi-modal models such as [Wav2Vec2](../model_doc/wav2vec2) (speech and text) | |
or [CLIP](../model_doc/clip) (text and vision) | |
- deprecated objects that were used in older versions of the library to preprocess data for GLUE or SQUAD. | |
## Multi-modal processors | |
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text, | |
vision and audio). This is handled by objects called processors, which group together two or more processing objects | |
such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio). | |
Those processors inherit from the following base class that implements the saving and loading functionality: | |
[[autodoc]] ProcessorMixin | |
## Deprecated processors | |
All processors follow the same architecture which is that of the | |
[`~data.processors.utils.DataProcessor`]. The processor returns a list of | |
[`~data.processors.utils.InputExample`]. These | |
[`~data.processors.utils.InputExample`] can be converted to | |
[`~data.processors.utils.InputFeatures`] in order to be fed to the model. | |
[[autodoc]] data.processors.utils.DataProcessor | |
[[autodoc]] data.processors.utils.InputExample | |
[[autodoc]] data.processors.utils.InputFeatures | |
## GLUE | |
[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) is a benchmark that evaluates the | |
performance of models across a diverse set of existing NLU tasks. It was released together with the paper [GLUE: A | |
multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) | |
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, | |
QQP, QNLI, RTE and WNLI. | |
Those processors are: | |
- [`~data.processors.utils.MrpcProcessor`] | |
- [`~data.processors.utils.MnliProcessor`] | |
- [`~data.processors.utils.MnliMismatchedProcessor`] | |
- [`~data.processors.utils.Sst2Processor`] | |
- [`~data.processors.utils.StsbProcessor`] | |
- [`~data.processors.utils.QqpProcessor`] | |
- [`~data.processors.utils.QnliProcessor`] | |
- [`~data.processors.utils.RteProcessor`] | |
- [`~data.processors.utils.WnliProcessor`] | |
Additionally, the following method can be used to load values from a data file and convert them to a list of | |
[`~data.processors.utils.InputExample`]. | |
[[autodoc]] data.processors.glue.glue_convert_examples_to_features | |
## XNLI | |
[The Cross-Lingual NLI Corpus (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) is a benchmark that evaluates the | |
quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on [*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/): pairs of text are labeled with textual entailment annotations for 15 | |
different languages (including both high-resource language such as English and low-resource languages such as Swahili). | |
It was released together with the paper [XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053) | |
This library hosts the processor to load the XNLI data: | |
- [`~data.processors.utils.XnliProcessor`] | |
Please note that since the gold labels are available on the test set, evaluation is performed on the test set. | |
An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/text-classification/run_xnli.py) script. | |
## SQuAD | |
[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) is a benchmark that | |
evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version | |
(v1.1) was released together with the paper [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250). The second version (v2.0) was released alongside the paper [Know What You Don't | |
Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822). | |
This library hosts a processor for each of the two versions: | |
### Processors | |
Those processors are: | |
- [`~data.processors.utils.SquadV1Processor`] | |
- [`~data.processors.utils.SquadV2Processor`] | |
They both inherit from the abstract class [`~data.processors.utils.SquadProcessor`] | |
[[autodoc]] data.processors.squad.SquadProcessor | |
- all | |
Additionally, the following method can be used to convert SQuAD examples into | |
[`~data.processors.utils.SquadFeatures`] that can be used as model inputs. | |
[[autodoc]] data.processors.squad.squad_convert_examples_to_features | |
These processors as well as the aforementioned method can be used with files containing the data as well as with the | |
*tensorflow_datasets* package. Examples are given below. | |
### Example usage | |
Here is an example using the processors as well as the conversion method using data files: | |
```python | |
# Loading a V2 processor | |
processor = SquadV2Processor() | |
examples = processor.get_dev_examples(squad_v2_data_dir) | |
# Loading a V1 processor | |
processor = SquadV1Processor() | |
examples = processor.get_dev_examples(squad_v1_data_dir) | |
features = squad_convert_examples_to_features( | |
examples=examples, | |
tokenizer=tokenizer, | |
max_seq_length=max_seq_length, | |
doc_stride=args.doc_stride, | |
max_query_length=max_query_length, | |
is_training=not evaluate, | |
) | |
``` | |
Using *tensorflow_datasets* is as easy as using a data file: | |
```python | |
# tensorflow_datasets only handle Squad V1. | |
tfds_examples = tfds.load("squad") | |
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) | |
features = squad_convert_examples_to_features( | |
examples=examples, | |
tokenizer=tokenizer, | |
max_seq_length=max_seq_length, | |
doc_stride=args.doc_stride, | |
max_query_length=max_query_length, | |
is_training=not evaluate, | |
) | |
``` | |
Another example using these processors is given in the [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) script. | |