yhavinga commited on
Commit
3760e28
1 Parent(s): c827f5b
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language:
4
+ - nl
5
+ - en
6
+ - multilingual
7
+ license: apache-2.0
8
+ tags:
9
+ - dutch
10
+ - english
11
+ - t5
12
+ - t5x
13
+ - ul2
14
+ - seq2seq
15
+ datasets:
16
+ - yhavinga/mc4_nl_cleaned
17
+ - yhavinga/nedd_wiki_news
18
+ inference: false
19
+ ---
20
+
21
+ # ul2-large-dutch-english for Dutch and English
22
+
23
+ Pretrained T5 model on Dutch and English using a UL2 (Mixture-of-Denoisers) objective.
24
+ The T5 model was introduced in
25
+ [this paper](https://arxiv.org/abs/1910.10683)
26
+ and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer).
27
+ The UL2 objective was introduced in
28
+ [this paper](https://arxiv.org/abs/2205.05131)
29
+ and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2).
30
+
31
+ **Note:** The Hugging Face inference widget is deactivated because this model needs a text-to-text fine-tuning on
32
+ a specific downstream task to be useful in practice.
33
+
34
+ ## Model description
35
+
36
+ T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format.
37
+ `ul2-large-dutch-english` T5 is a transformers model pretrained on a very large corpus of
38
+ Dutch and English data in a self-supervised fashion.
39
+ This means it was pretrained on the raw texts only, with no humans labelling them in any way
40
+ (which is why it can use lots of publicly available data) with an automatic process to generate
41
+ inputs and outputs from those texts.
42
+
43
+
44
+ This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining:
45
+ - GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202)
46
+ - Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning
47
+ - Pre-trained on self-supervised objective only without mixing in the downstream tasks
48
+ - No parameter sharing between embedding and classifier layer
49
+
50
+
51
+
52
+ ### UL2 pretraining objective
53
+
54
+ This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training
55
+ paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where
56
+ the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers
57
+ that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of
58
+ three denoising tasks:
59
+
60
+ 1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective;
61
+ 2. X-denoising (or extreme span corruption); and
62
+ 3. S-denoising (or sequential PrefixLM).
63
+
64
+ During pre-training, we sample from the available denoising tasks based on user-specified ratios.
65
+ UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training
66
+ denoising task. During the pre-training, a paradigm token is inserted to the input
67
+ (`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand.
68
+ Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream
69
+ fine-tuning tasks.
70
+
71
+ ## Intended uses & limitations
72
+
73
+ This model was only pretrained in a self-supervised way excluding any supervised training.
74
+ Therefore, this model has to be fine-tuned before it is usable on a downstream task,
75
+ like text classification, unlike the Google's original T5 model.
76
+
77
+ **Note:** You most likely need to fine-tune these T5/UL2 models without mixed precision
78
+ so fine-tune them with full fp32 precision. Fine-tuning with Flax in bf16 - `model.to_bf16()` - is possible
79
+ if you set the mask correctly to exclude layernorm and embedding layers. Also note that the T5x pre-training
80
+ and fine-tuning configs set `z_loss` to 1e-4, which is used to keep the loss scale from underflowing.
81
+ You can also find more fine-tuning tips from [here](https://discuss.huggingface.co/t/t5-finetuning-tips), for example.
82
+
83
+ **Note**: For fine-tuning, most likely you can get better results if you insert a prefix token
84
+ of `[NLU]`, `[NLG]`, or `[S2S]` to your input texts.
85
+ For general language understanding fine-tuning tasks, you could use the `[NLU]` token.
86
+ For GPT-style causal language generation, you could use the `[S2S]` token.
87
+ The token `[NLG]` of the X-denoising pretrain task is somewhat mix between the language understanding and causal language
88
+ generation so the token `[NLG]` could maybe be used for language generation fine-tuning too.
89
+
90
+ ### How to use
91
+
92
+ Here is how to use this model in PyTorch:
93
+
94
+ ```python
95
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
96
+
97
+ tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-large-dutch-english", use_fast=False)
98
+ model = T5ForConditionalGeneration.from_pretrained("yhavinga/ul2-large-dutch-english")
99
+ ```
100
+
101
+ and in Flax:
102
+
103
+ ```python
104
+ from transformers import T5Tokenizer, FlaxT5ForConditionalGeneration
105
+
106
+ tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-large-dutch-english", use_fast=False)
107
+ model = FlaxT5ForConditionalGeneration.from_pretrained("yhavinga/ul2-large-dutch-english")
108
+ ```
109
+
110
+
111
+ ### Limitations and bias
112
+
113
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral.
114
+ Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
115
+
116
+ ## Training data
117
+
118
+ The `ul2-large-dutch-english` T5 model was pre-trained simultaneously on a combination of several datasets,
119
+ including the `full_en_nl` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web
120
+ crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), the English subset of Wikipedia (2022-03-01),
121
+ and a subset of "mc4_nl_cleaned"
122
+ containing only texts from Dutch and Belgian newspapers. This last dataset is oversampled to bias the model
123
+ towards descriptions of events in the Netherlands and Belgium.
124
+
125
+
126
+
127
+ ## Training procedure
128
+
129
+ ### Preprocessing
130
+
131
+ The ul2-large-dutch-english T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens.
132
+ The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper,
133
+ `[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline.
134
+ During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens.
135
+ The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises
136
+ between `dutch` and `Dutch`.
137
+ Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens.
138
+
139
+ ### Pretraining
140
+ The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/),
141
+ for 1000000 steps with a batch size of 64
142
+ (in total 32 B tokens).
143
+ The optimizer used was AdaFactor with learning rate warmup for 10K steps with a constant learning rate of 1e-2,
144
+ and then an inverse square root decay (exponential decay) of the learning rate after.
145
+ The model was trained with Google's Jax/Flax based [t5x framework](https://github.com/google-research/t5x) with help
146
+ from [Stephenn Fernandes](https://huggingface.co/StephennFernandes) to get started writing task definitions that wrap
147
+ HF datasets.
148
+
149
+ The UL2 training objective code used with the [t5x framework](https://github.com/google-research/t5x) was copied and
150
+ slightly modified from the [UL2 paper](https://arxiv.org/pdf/2205.05131.pdf) appendix chapter 9.2 by the authors
151
+ of the Finnish ul2 models. Used UL2 objective code is available in the repository
152
+ [Finnish-NLP/ul2-base-nl36-finnish](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) in the files `ul2_objective.py` and `tasks.py`.
153
+ UL2's mixture-of-denoisers configuration was otherwise equal to the UL2 paper
154
+ but for the rate of mixing denoisers, 20% for S-denoising was used (suggested at the paper chapter 4.5)
155
+ and the rest was divided equally between the R-denoising and X-denoising (i.e. 40% for both).
156
+ ### Model list
157
+
158
+ Models in this series:
159
+
160
+ | | ul2-base-dutch-english | ul2-large-dutch-english | ul2-small-dutch-english |
161
+ |:---------------------|:-------------------------|:--------------------------|:--------------------------|
162
+ | model_type | t5 | t5 | t5 |
163
+ | _pipeline_tag | text2text-generation | text2text-generation | text2text-generation |
164
+ | d_model | 768 | 1024 | 512 |
165
+ | d_ff | 2048 | 2816 | 1024 |
166
+ | num_heads | 12 | 16 | 6 |
167
+ | d_kv | 64 | 64 | 64 |
168
+ | num_layers | 12 | 24 | 8 |
169
+ | num_decoder_layers | 12 | 24 | 8 |
170
+ | feed_forward_proj | gated-gelu | gated-gelu | gated-gelu |
171
+ | dense_act_fn | gelu_new | gelu_new | gelu_new |
172
+ | vocab_size | 32128 | 32128 | 32128 |
173
+ | tie_word_embeddings | 0 | 0 | 0 |
174
+ | torch_dtype | float32 | float32 | float32 |
175
+ | _gin_batch_size | 128 | 64 | 128 |
176
+ | _gin_z_loss | 0.0001 | 0.0001 | 0.0001 |
177
+ | _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' |
178
+
179
+
180
+ ## Evaluation results
181
+
182
+ See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog.
183
+
184
+ ## Acknowledgements
185
+
186
+ This project would not have been possible without compute generously provided by Google through the
187
+ [TPU Research Cloud](https://sites.research.google/trc/).
188
+ Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions.
189
+ Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework.
190
+
191
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
192
+
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"[new_id_17]": 32117, "[new_id_20]": 32120, "[new_id_13]": 32113, "[new_id_2]": 32102, "[new_id_16]": 32116, "[new_id_7]": 32107, "[new_id_5]": 32105, "[new_id_1]": 32101, "[new_id_15]": 32115, "[new_id_12]": 32112, "[new_id_0]": 32100, "[new_id_11]": 32111, "[new_id_25]": 32125, "[new_id_24]": 32124, "[new_id_10]": 32110, "[new_id_27]": 32127, "[new_id_23]": 32123, "[new_id_14]": 32114, "[new_id_22]": 32122, "[new_id_21]": 32121, "[new_id_19]": 32119, "[new_id_3]": 32103, "[new_id_4]": 32104, "[new_id_18]": 32118, "[new_id_9]": 32109, "[new_id_8]": 32108, "[new_id_26]": 32126, "[new_id_6]": 32106}
config.gin ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __gin__ import dynamic_registration
2
+ import __main__ as train_script
3
+ import seqio
4
+ import t5.data.mixtures
5
+ from t5x import adafactor
6
+ from t5x.examples.t5 import network
7
+ from t5x import gin_utils
8
+ from t5x import models
9
+ from t5x import partitioning
10
+ from t5x import trainer
11
+ from t5x import utils
12
+ import tasks.nedd_tasks
13
+ import tasks.ul2_tasks as tasks2
14
+
15
+ # Macros:
16
+ # ==============================================================================
17
+ BATCH_SIZE = 64
18
+ DROPOUT_RATE = 0.0
19
+ LABEL_SMOOTHING = 0.0
20
+ LOSS_NORMALIZING_FACTOR = None
21
+ MIXTURE_OR_TASK_MODULE = None
22
+ MIXTURE_OR_TASK_NAME = 'ul2_en_nl_mc4_nedd_wiki_news_mix_1'
23
+ MODEL = @models.EncoderDecoderModel()
24
+ MODEL_DIR = 'ul2_large_en_nl_mc4_nedd_wiki_news_nl'
25
+ OPTIMIZER = @adafactor.Adafactor()
26
+ RANDOM_SEED = None
27
+ SHUFFLE_TRAIN_EXAMPLES = True
28
+ TASK_FEATURE_LENGTHS = {'inputs': 512, 'targets': 512}
29
+ TRAIN_STEPS = 1000000
30
+ USE_CACHED_TASKS = False
31
+ USE_HARDWARE_RNG = False
32
+ VOCABULARY = @seqio.SentencePieceVocabulary()
33
+ Z_LOSS = 0.0001
34
+
35
+ # Parameters for adafactor.Adafactor:
36
+ # ==============================================================================
37
+ adafactor.Adafactor.decay_rate = 0.8
38
+ adafactor.Adafactor.logical_factor_rules = \
39
+ @adafactor.standard_logical_factor_rules()
40
+ adafactor.Adafactor.step_offset = 0
41
+
42
+ # Parameters for utils.CheckpointConfig:
43
+ # ==============================================================================
44
+ utils.CheckpointConfig.restore = @utils.RestoreCheckpointConfig()
45
+ utils.CheckpointConfig.save = @utils.SaveCheckpointConfig()
46
+
47
+ # Parameters for utils.create_learning_rate_scheduler:
48
+ # ==============================================================================
49
+ utils.create_learning_rate_scheduler.base_learning_rate = 1.0
50
+ utils.create_learning_rate_scheduler.factors = 'constant * rsqrt_decay'
51
+ utils.create_learning_rate_scheduler.warmup_steps = 10000
52
+
53
+ # Parameters for train/utils.DatasetConfig:
54
+ # ==============================================================================
55
+ train/utils.DatasetConfig.batch_size = %BATCH_SIZE
56
+ train/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
57
+ train/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
58
+ train/utils.DatasetConfig.pack = True
59
+ train/utils.DatasetConfig.seed = None
60
+ train/utils.DatasetConfig.shuffle = %SHUFFLE_TRAIN_EXAMPLES
61
+ train/utils.DatasetConfig.split = 'train'
62
+ train/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
63
+ train/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
64
+
65
+ # Parameters for train_eval/utils.DatasetConfig:
66
+ # ==============================================================================
67
+ train_eval/utils.DatasetConfig.batch_size = %BATCH_SIZE
68
+ train_eval/utils.DatasetConfig.mixture_or_task_name = %MIXTURE_OR_TASK_NAME
69
+ train_eval/utils.DatasetConfig.module = %MIXTURE_OR_TASK_MODULE
70
+ train_eval/utils.DatasetConfig.pack = True
71
+ train_eval/utils.DatasetConfig.seed = 42
72
+ train_eval/utils.DatasetConfig.shuffle = False
73
+ train_eval/utils.DatasetConfig.split = 'validation'
74
+ train_eval/utils.DatasetConfig.task_feature_lengths = %TASK_FEATURE_LENGTHS
75
+ train_eval/utils.DatasetConfig.use_cached = %USE_CACHED_TASKS
76
+
77
+ # Parameters for models.EncoderDecoderModel:
78
+ # ==============================================================================
79
+ models.EncoderDecoderModel.input_vocabulary = %VOCABULARY
80
+ models.EncoderDecoderModel.label_smoothing = %LABEL_SMOOTHING
81
+ models.EncoderDecoderModel.loss_normalizing_factor = %LOSS_NORMALIZING_FACTOR
82
+ models.EncoderDecoderModel.module = @network.Transformer()
83
+ models.EncoderDecoderModel.optimizer_def = %OPTIMIZER
84
+ models.EncoderDecoderModel.output_vocabulary = %VOCABULARY
85
+ models.EncoderDecoderModel.z_loss = %Z_LOSS
86
+
87
+ # Parameters for partitioning.PjitPartitioner:
88
+ # ==============================================================================
89
+ partitioning.PjitPartitioner.logical_axis_rules = \
90
+ @partitioning.standard_logical_axis_rules()
91
+ partitioning.PjitPartitioner.model_parallel_submesh = None
92
+ partitioning.PjitPartitioner.num_partitions = 1
93
+
94
+ # Parameters for utils.RestoreCheckpointConfig:
95
+ # ==============================================================================
96
+ utils.RestoreCheckpointConfig.path = []
97
+
98
+ # Parameters for utils.SaveCheckpointConfig:
99
+ # ==============================================================================
100
+ utils.SaveCheckpointConfig.dtype = 'float32'
101
+ utils.SaveCheckpointConfig.keep = 4
102
+ utils.SaveCheckpointConfig.period = 50000
103
+ utils.SaveCheckpointConfig.save_dataset = False
104
+ utils.SaveCheckpointConfig.use_gda = False
105
+
106
+ # Parameters for seqio.SentencePieceVocabulary:
107
+ # ==============================================================================
108
+ seqio.SentencePieceVocabulary.sentencepiece_model_file = \
109
+ 'gs://t5-dutch-english/vocabs/nedd.32000.128extra/spiece.model'
110
+
111
+ # Parameters for network.T5Config:
112
+ # ==============================================================================
113
+ network.T5Config.dropout_rate = %DROPOUT_RATE
114
+ network.T5Config.dtype = 'bfloat16'
115
+ network.T5Config.emb_dim = 1024
116
+ network.T5Config.head_dim = 64
117
+ network.T5Config.logits_via_embedding = False
118
+ network.T5Config.mlp_activations = ('gelu', 'linear')
119
+ network.T5Config.mlp_dim = 2816
120
+ network.T5Config.num_decoder_layers = 24
121
+ network.T5Config.num_encoder_layers = 24
122
+ network.T5Config.num_heads = 16
123
+ network.T5Config.vocab_size = 32128
124
+
125
+ # Parameters for train_script.train:
126
+ # ==============================================================================
127
+ train_script.train.checkpoint_cfg = @utils.CheckpointConfig()
128
+ train_script.train.eval_period = 2000
129
+ train_script.train.eval_steps = 20
130
+ train_script.train.infer_eval_dataset_cfg = None
131
+ train_script.train.model = %MODEL
132
+ train_script.train.model_dir = %MODEL_DIR
133
+ train_script.train.partitioner = @partitioning.PjitPartitioner()
134
+ train_script.train.random_seed = %RANDOM_SEED
135
+ train_script.train.stats_period = 100
136
+ train_script.train.summarize_config_fn = @gin_utils.summarize_gin_config
137
+ train_script.train.total_steps = %TRAIN_STEPS
138
+ train_script.train.train_dataset_cfg = @train/utils.DatasetConfig()
139
+ train_script.train.train_eval_dataset_cfg = @train_eval/utils.DatasetConfig()
140
+ train_script.train.trainer_cls = @trainer.Trainer
141
+ train_script.train.use_hardware_rng = %USE_HARDWARE_RNG
142
+
143
+ # Parameters for trainer.Trainer:
144
+ # ==============================================================================
145
+ trainer.Trainer.learning_rate_fn = @utils.create_learning_rate_scheduler()
146
+ trainer.Trainer.num_microbatches = None
147
+
148
+ # Parameters for network.Transformer:
149
+ # ==============================================================================
150
+ network.Transformer.config = @network.T5Config()
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "yhavinga/ul2_large_dutch_english",
3
+ "architectures": [
4
+ "T5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 2816,
7
+ "d_kv": 64,
8
+ "d_model": 1024,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "gelu_new",
11
+ "dropout_rate": 0.1,
12
+ "eos_token_id": 1,
13
+ "feed_forward_proj": "gated-gelu",
14
+ "initializer_factor": 1.0,
15
+ "is_encoder_decoder": true,
16
+ "is_gated_act": true,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "n_positions": 512,
20
+ "num_decoder_layers": 24,
21
+ "num_heads": 16,
22
+ "num_layers": 24,
23
+ "output_past": true,
24
+ "pad_token_id": 0,
25
+ "relative_attention_max_distance": 128,
26
+ "relative_attention_num_buckets": 32,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "float32",
29
+ "transformers_version": "4.24.0",
30
+ "use_cache": true,
31
+ "vocab_size": 32128
32
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4fe142609d6abec1d8c99e2dfe9d118b96b0dd2028283b74f680079da4327e8
3
+ size 3132624407
model-info.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcd540d858f0dd6f5d186e6920b96b827f6b219994fd26ab84f7ea01f22fd05f
3
+ size 3132785797
special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:caa6e2f21aeec181276ab80273e3f869ce303ccb8602d68e0524783c3581092d
3
+ size 800223
spiece.vocab ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "name_or_path": "yhavinga/ul2-base-en-nl",
107
+ "pad_token": "<pad>",
108
+ "sp_model_kwargs": {},
109
+ "special_tokens_map_file": null,
110
+ "tokenizer_class": "T5Tokenizer",
111
+ "unk_token": "<unk>",
112
+ "use_fast_tokenizer": false
113
+ }
train/events.out.tfevents.1673469304.t1v-n-e0ca6cd3-w-0.414828.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:755b6e1773bcea0ee01cec88da90650753ff78839f8b20812bf6a9db7e652621
3
+ size 10189377
train/events.out.tfevents.1673860280.t1v-n-e0ca6cd3-w-0.10042.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:709882af3ea630a061dc2ae2dd36943167f026944e971a6619bb1f21c751ccd6
3
+ size 5150450
training_eval/mc4_en_nl_ul2_denoising/events.out.tfevents.1673469304.t1v-n-e0ca6cd3-w-0.414828.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b0c34807fcf7a99b970f935b70654ee7e9737543cdf0888803537900cb0f60a
3
+ size 450261
training_eval/mc4_en_nl_ul2_denoising/events.out.tfevents.1673860280.t1v-n-e0ca6cd3-w-0.10042.1.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5746fd124e9eff6b3cd25fb95d7dab6c25c0dbb288e0c17a975ac68552595329
3
+ size 227037
training_eval/ul2_en_nl_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673469304.t1v-n-e0ca6cd3-w-0.414828.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:33fd9234bd0c5fcb3045245f51d5dd87613e3a5c036c41c15b5b683a06118113
3
+ size 450261
training_eval/ul2_en_nl_mc4_nedd_wiki_news_mix_1/events.out.tfevents.1673860280.t1v-n-e0ca6cd3-w-0.10042.2.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6094ab76a010e2e1f9d57d7c8b6f1378d11da15c5df448a33d57402a816340b9
3
+ size 227037