Updated README
Browse files
README.md
CHANGED
@@ -1,15 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# North-T5
|
2 |
The North-T5 is a set of Norwegian sequence-to-sequence-models. It builds upon the flexible T5 text-to-text platform and can be used for a variety of NLP tasks ranging from classification to translation.
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
5 |
-
## Main versions - download
|
6 |
-
|**Model:** | **Parameters** |**Transformers** |**T5X** |
|
7 |
-
|:-----------|:------------|:------------|:------------|
|
8 |
-
|North-T5-small|60 million | HuggingFace | GCloud Bucket |
|
9 |
-
|North-T5-base|220 million | HuggingFace | GCloud Bucket |
|
10 |
-
|North-T5-large|770 million | HuggingFace | GCloud Bucket |
|
11 |
-
|North-T5-xl|3 billion | HuggingFace | GCloud Bucket |
|
12 |
-
|North-T5-xxl|11 billion| N/A | GCloud Bucket |
|
13 |
|
14 |
## Performance
|
15 |
A thorough evaluation of the North-T5 models is planned. I strongly recommend any external researchers to make their own evaluation. The main advantage with the T5-models are their flexibility. Traditionally, encoder-only models (like BERT) excels in classification tasks, while seq-2-seq models are easier to train for tasks like translation and Q&A. Despite this, here are the results from using North-T5 on the political classification task explained [here](https://arxiv.org/abs/2104.09617).
|
@@ -28,15 +44,14 @@ A thorough evaluation of the North-T5 models is planned. I strongly recommend an
|
|
28 |
|
29 |
This is preliminary results. The [results](https://arxiv.org/abs/2104.09617) from the BERT-models are based on the test-results from the best model after 10 runs with early stopping and a decaying learning rate. The T5-results are the average of five runs on the evaluation set. The small-model was trained for 10.000 steps, while the rest for 5.000 steps. A fixed learning rate was used (no decay), and no early stopping. Neither was the recommended rank classification used. We use a max sequence length of 512. This method simplifies the test setup and gives results that are easy to interpret. However, the results from the T5 model might actually be a bit sub-optimal.
|
30 |
|
31 |
-
## Sub-versions of North-T5
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
-
|**Model:** | **Description** |
|
35 |
-
|:-----------|:------------|
|
36 |
-
|North-T5-base-LM |Pretrained for an addtional 100k steps on the LM objective described in Raffel & al. In a way this turns a masked language model into an autoregressive model. It also prepares the model for some tasks. When for instance doing translation and NLI, it is well documented that there is a clear benefit to do a step of unsupervised LM-training before starting the finetuning.|
|
37 |
-
|North-byT5-base | A vocabulary free version of T5. Trained exactly like North-T5, but instead of the 200.000 vocabulary, this model operates directly on the raw text. The model architecture might be of particulary interest for tasks involving for instance spelling correction, OCR-cleaning, handwriting recognition etc. However, it will, by design, have a shorter maximum sequence length.|
|
38 |
-
|North-T5-base-modern | Pretrained for an additional 200k steps on a blanaced Bokmål and Nynorsk corpus. While original made for doing translation between Bokmål and Nynorsk, it might also give improved results on tasks where you know that the input/output is modern "standard" text. A significant part of the training corpus is newspapers and reports.|
|
39 |
-
|North-T5-base-scandinavian |Pretrained for an additional 200k steps on a corpus with the Scandinavian languages (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)). The model was trained for increasing the understanding of what effect such training has on various languages.|
|
40 |
|
41 |
## Fine-tuned versions
|
42 |
As explained below, the model really needs to be fine-tuned for specific tasks. This procedure is simple, and the model is not very sensitive to the hyper-parameters used. Usually a decent result can be obtained by using a fixed learning rate of 1e-3. Smaller versions of the model typically needs to be trained for a longer time. It is easy to train the base-models in a Google Colab. I will provide an exampel Notebook on this soon.
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- no
|
4 |
+
- nn
|
5 |
+
- sv
|
6 |
+
- dk
|
7 |
+
- is
|
8 |
+
- en
|
9 |
+
|
10 |
+
datasets:
|
11 |
+
- nbailab/NCC
|
12 |
+
- mc4
|
13 |
+
- wikipedia
|
14 |
+
|
15 |
+
license: apache-2.0
|
16 |
+
---
|
17 |
+
|
18 |
# North-T5
|
19 |
The North-T5 is a set of Norwegian sequence-to-sequence-models. It builds upon the flexible T5 text-to-text platform and can be used for a variety of NLP tasks ranging from classification to translation.
|
20 |
|
21 |
+
| |**Small** <br />_60M_|**Base** <br />_220M_|**Large** <br />_770M_|**XL** <br />_3B_|**XXL** <br />_11B_|
|
22 |
+
|:-----------|:------------:|:------------:|:------------:|:------------:|:------------:|
|
23 |
+
|North-T5‑NCC|[🤗](https://huggingface.co/north/t5_small_NCC)|[🤗](https://huggingface.co/north/t5_base_NCC)|[🤗](https://huggingface.co/north/t5_large_NCC)|[🤗](https://huggingface.co/north/t5_xl_NCC)|[🤗](https://huggingface.co/north/t5_xxl_NCC)||
|
24 |
+
|North-T5‑NCC‑lm|[🤗](https://huggingface.co/north/t5_small_NCC_lm)|[🤗](https://huggingface.co/north/t5_base_NCC_lm)|✔|[🤗](https://huggingface.co/north/t5_xl_NCC_lm)|[🤗](https://huggingface.co/north/t5_xxl_NCC_lm)||
|
25 |
+
|
26 |
+
## T5X Checkpoint
|
27 |
+
The original T5X checkpoint is also available for this model in the [Google Cloud Bucket](gs://north-t5x/pretrained_models/large/norwegian_NCC_plus_English_pluss100k_lm_t5x_large/).
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Performance
|
31 |
A thorough evaluation of the North-T5 models is planned. I strongly recommend any external researchers to make their own evaluation. The main advantage with the T5-models are their flexibility. Traditionally, encoder-only models (like BERT) excels in classification tasks, while seq-2-seq models are easier to train for tasks like translation and Q&A. Despite this, here are the results from using North-T5 on the political classification task explained [here](https://arxiv.org/abs/2104.09617).
|
|
|
44 |
|
45 |
This is preliminary results. The [results](https://arxiv.org/abs/2104.09617) from the BERT-models are based on the test-results from the best model after 10 runs with early stopping and a decaying learning rate. The T5-results are the average of five runs on the evaluation set. The small-model was trained for 10.000 steps, while the rest for 5.000 steps. A fixed learning rate was used (no decay), and no early stopping. Neither was the recommended rank classification used. We use a max sequence length of 512. This method simplifies the test setup and gives results that are easy to interpret. However, the results from the T5 model might actually be a bit sub-optimal.
|
46 |
|
47 |
+
## Sub-versions of North-T5
|
48 |
+
The following sub-versions are available. Other versions will be available shorter.
|
49 |
+
|
50 |
+
|**Model** | **Description** |
|
51 |
+
|:-----------|:-------|
|
52 |
+
|**North‑T5‑NCC** |This is the main version. It is trained an additonal 500.000 steps on from the mT5 checkpoint. The training corpus is based on [the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC). In addition there are added data from MC4 and English Wikipedia.|
|
53 |
+
|**North‑T5‑NCC‑lm**|Pretrained for an addtional 100k steps on the LM objective discussed in the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf). In a way this turns a masked language model into an autoregressive model. It also prepares the model for some tasks. When for instance doing translation and NLI, it is well documented that there is a clear benefit to do a step of unsupervised LM-training before starting the finetuning.|
|
54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
## Fine-tuned versions
|
57 |
As explained below, the model really needs to be fine-tuned for specific tasks. This procedure is simple, and the model is not very sensitive to the hyper-parameters used. Usually a decent result can be obtained by using a fixed learning rate of 1e-3. Smaller versions of the model typically needs to be trained for a longer time. It is easy to train the base-models in a Google Colab. I will provide an exampel Notebook on this soon.
|