Fill-Mask
Transformers
PyTorch
Safetensors
English
perceiver
nielsr HF staff commited on
Commit
371133a
·
1 Parent(s): 699554f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ datasets:
5
+ - wikipedia
6
+ - c4
7
+ ---
8
+
9
+ # Perceiver IO for language
10
+
11
+ Perceiver IO model pre-trained on the Masked Language Modeling (MLM) task proposed in [BERT](https://arxiv.org/abs/1810.04805) using a large text corpus obtained by combining [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [C4](https://huggingface.co/datasets/c4). It was introduced in the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Jaegle et al. and first released in [this repository](https://github.com/deepmind/deepmind-research/tree/master/perceiver).
12
+
13
+ Disclaimer: The team releasing Perceiver IO did not write a model card for this model so this model card has been written by the Hugging Face team.
14
+
15
+ ## Model description
16
+
17
+ Perceiver IO is a transformer encoder model that can be applied on any modality (text, images, audio, video, ...). The core idea is to employ the self-attention mechanism on a set of not-too large latent vectors, and only use the inputs to perform cross-attention with the latents. This allows for the time and memory requirements of the self-attention mechanism to not depend on the size of the inputs.
18
+
19
+ To decode, the authors employ so-called decoder queries, which allow to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics. For masked language modeling, the output is a tensor containing the prediction scores of the language modeling head, of shape (batch_size, seq_length, vocab_size).
20
+
21
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg" alt="drawing" width="600"/>
22
+
23
+ <small> Perceiver IO architecture.</small>
24
+
25
+ As the time and memory requirements of the self-attention mechanism don't depend on the size of the inputs, the Perceiver IO authors train the model directly on raw UTF-8 bytes, rather than on subwords as is done in models like BERT, RoBERTa and GPT-2. This has many benefits: one doesn't need to train a tokenizer before training the model, one doesn't need to maintain a (fixed) vocabulary file, and this also doesn't hurt model performance as shown by [Bostrom et al., 2020](https://arxiv.org/abs/2004.03720).
26
+
27
+ By pre-training the model, it learns an inner representation of language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
28
+ classifier using the features produced by the BERT model as inputs.
29
+
30
+ ## Intended uses & limitations
31
+
32
+ You can use the raw model for masked language modeling, but the model is intended to be fine-tuned on a labeled dataset. See the [model hub](https://huggingface.co/models?search=deepmind/perceiver) to look for fine-tuned versions on a task that interests you.
33
+
34
+ ### How to use
35
+
36
+ Here is how to use this model in PyTorch:
37
+
38
+ ```python
39
+ from transformers import PerceiverTokenizer, PerceiverForMaskedLM
40
+
41
+ tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")
42
+ model = PerceiverForMaskedLM.from_pretrained("deepmind/language-perceiver")
43
+
44
+ text = "This is an incomplete sentence where some words are missing."
45
+ # prepare input
46
+ encoding = tokenizer(text, padding="max_length", return_tensors="pt")
47
+ # mask " missing.". Note that the model performs much better if the masked span starts with a space.
48
+ encoding.input_ids[0, 52:61] = tokenizer.mask_token_id
49
+ inputs, input_mask = encoding.input_ids.to(device), encoding.attention_mask.to(device)
50
+
51
+ # forward pass
52
+ outputs = model(inputs=inputs, attention_mask=input_mask)
53
+ logits = outputs.logits
54
+ masked_tokens_predictions = logits[0, 51:61].argmax(dim=-1)
55
+ print(tokenizer.decode(masked_tokens_predictions))
56
+ >>> should print " missing."
57
+ ```
58
+
59
+ ## Training data
60
+
61
+ This model was pretrained on a combination of [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [C4](https://huggingface.co/datasets/c4). 70% of the training tokens were sampled from the C4 dataset and the remaining 30% from Wikipedia. The authors concatenate 10 documents before splitting into crops to reduce wasteful computation on padding tokens.
62
+
63
+ ## Training procedure
64
+
65
+ ### Preprocessing
66
+
67
+ Text preprocessing is trivial: it only involves encoding text into UTF-8 bytes, and padding them up to the same length (2048).
68
+
69
+ ### Pretraining
70
+
71
+ Hyperparameter details can be found in table 9 of the [paper](https://arxiv.org/abs/2107.14795).
72
+
73
+ ## Evaluation results
74
+
75
+ This model is able to achieve an average score of 81.8 on GLUE. For more details, we refer to table 3 f the original paper.
76
+
77
+ ### BibTeX entry and citation info
78
+
79
+ ```bibtex
80
+ @article{DBLP:journals/corr/abs-2107-14795,
81
+ author = {Andrew Jaegle and
82
+ Sebastian Borgeaud and
83
+ Jean{-}Baptiste Alayrac and
84
+ Carl Doersch and
85
+ Catalin Ionescu and
86
+ David Ding and
87
+ Skanda Koppula and
88
+ Daniel Zoran and
89
+ Andrew Brock and
90
+ Evan Shelhamer and
91
+ Olivier J. H{\'{e}}naff and
92
+ Matthew M. Botvinick and
93
+ Andrew Zisserman and
94
+ Oriol Vinyals and
95
+ Jo{\~{a}}o Carreira},
96
+ title = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
97
+ Outputs},
98
+ journal = {CoRR},
99
+ volume = {abs/2107.14795},
100
+ year = {2021},
101
+ url = {https://arxiv.org/abs/2107.14795},
102
+ eprinttype = {arXiv},
103
+ eprint = {2107.14795},
104
+ timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
105
+ biburl = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
106
+ bibsource = {dblp computer science bibliography, https://dblp.org}
107
+ }
108
+ ```