gonzalez-agirre
commited on
Commit
·
1947df0
1
Parent(s):
dfc8775
Update README.md
Browse files
README.md
CHANGED
@@ -6,7 +6,7 @@ tags:
|
|
6 |
- "catalan"
|
7 |
- "masked-lm"
|
8 |
- "longformer"
|
9 |
-
- "longformer-base-4096-ca"
|
10 |
- "CaText"
|
11 |
- "Catalan Textual Corpus"
|
12 |
|
@@ -22,7 +22,7 @@ widget:
|
|
22 |
|
23 |
---
|
24 |
|
25 |
-
# Catalan Longformer (longformer-base-4096-ca) base model
|
26 |
|
27 |
## Table of Contents
|
28 |
<details>
|
@@ -52,13 +52,13 @@ widget:
|
|
52 |
|
53 |
## Model description
|
54 |
|
55 |
-
The **longformer-base-4096-ca** is the [Longformer](https://huggingface.co/allenai/longformer-base-4096) version of the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) masked language model for the Catalan language.
|
56 |
|
57 |
The Longformer model uses a combination of sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. Please refer to the original [paper](https://arxiv.org/abs/2004.05150) for more details on how to set global attention.
|
58 |
|
59 |
## Intended uses and limitations
|
60 |
|
61 |
-
The **longformer-base-4096-ca** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
62 |
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
63 |
|
64 |
## How to use
|
@@ -69,8 +69,8 @@ Here is how to use this model:
|
|
69 |
from transformers import AutoModelForMaskedLM
|
70 |
from transformers import AutoTokenizer, FillMaskPipeline
|
71 |
from pprint import pprint
|
72 |
-
tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/longformer-base-4096-ca')
|
73 |
-
model = AutoModelForMaskedLM.from_pretrained('projecte-aina/longformer-base-4096-ca')
|
74 |
model.eval()
|
75 |
pipeline = FillMaskPipeline(model, tokenizer_hf)
|
76 |
text = f"Em dic <mask>."
|
@@ -106,7 +106,7 @@ The training corpus consists of several corpora gathered from web crawling and p
|
|
106 |
| Vilaweb | 0.06 |
|
107 |
| Tweets | 0.02 |
|
108 |
|
109 |
-
For this specific pre-training process we have performed an undersampling process to obtain a corpus of 5,3 GB.
|
110 |
|
111 |
### Training procedure
|
112 |
|
@@ -117,7 +117,7 @@ The training corpus has been tokenized using a byte version of Byte-Pair Encodin
|
|
117 |
|
118 |
### CLUB benchmark
|
119 |
|
120 |
-
The
|
121 |
that has been created along with the model.
|
122 |
|
123 |
It contains the following tasks and their related datasets:
|
@@ -163,7 +163,7 @@ After fine-tuning the model on the downstream tasks, it achieved the following p
|
|
163 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
164 |
| RoBERTa-large-ca-v2 | **89.82** | **99.02** | **83.41** | **75.46** | 83.61 | **89.34/75.50** | **89.20**/75.77 | **90.72/79.06** | **73.79**/55.34 |
|
165 |
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/**75.91** | 89.50/76.63 | 73.64/**55.42** |
|
166 |
-
| Longformer-base-4096-ca | 88.49 | 98.98 | 78.37 | 73.79 | **83.89** | 87.59/72.33 | 88.70/**76.05** | 89.33/77.03 | 73.09/54.83 |
|
167 |
| BERTa | 89.76 | 98.96 | 80.19 | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/77.14 | 69.20/51.47 |
|
168 |
| mBERT | 86.87 | 98.83 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
|
169 |
| XLM-RoBERTa | 86.31 | 98.89 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
|
|
|
6 |
- "catalan"
|
7 |
- "masked-lm"
|
8 |
- "longformer"
|
9 |
+
- "longformer-base-4096-ca-v2"
|
10 |
- "CaText"
|
11 |
- "Catalan Textual Corpus"
|
12 |
|
|
|
22 |
|
23 |
---
|
24 |
|
25 |
+
# Catalan Longformer (longformer-base-4096-ca-v2) base model
|
26 |
|
27 |
## Table of Contents
|
28 |
<details>
|
|
|
52 |
|
53 |
## Model description
|
54 |
|
55 |
+
The **longformer-base-4096-ca-v2** is the [Longformer](https://huggingface.co/allenai/longformer-base-4096) version of the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) masked language model for the Catalan language. The use of these models allows us to process larger contexts (up to 4096 tokens) as input without the need of additional aggregation strategies. The pretraining process of this model started from the **roberta-base-ca-v2** checkpoint and was pretrained for MLM on both short and long documents in Catalan.
|
56 |
|
57 |
The Longformer model uses a combination of sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. Please refer to the original [paper](https://arxiv.org/abs/2004.05150) for more details on how to set global attention.
|
58 |
|
59 |
## Intended uses and limitations
|
60 |
|
61 |
+
The **longformer-base-4096-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
|
62 |
However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
|
63 |
|
64 |
## How to use
|
|
|
69 |
from transformers import AutoModelForMaskedLM
|
70 |
from transformers import AutoTokenizer, FillMaskPipeline
|
71 |
from pprint import pprint
|
72 |
+
tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/longformer-base-4096-ca-v2')
|
73 |
+
model = AutoModelForMaskedLM.from_pretrained('projecte-aina/longformer-base-4096-ca-v2')
|
74 |
model.eval()
|
75 |
pipeline = FillMaskPipeline(model, tokenizer_hf)
|
76 |
text = f"Em dic <mask>."
|
|
|
106 |
| Vilaweb | 0.06 |
|
107 |
| Tweets | 0.02 |
|
108 |
|
109 |
+
For this specific pre-training process, we have performed an undersampling process to obtain a corpus of 5,3 GB.
|
110 |
|
111 |
### Training procedure
|
112 |
|
|
|
117 |
|
118 |
### CLUB benchmark
|
119 |
|
120 |
+
The **longformer-base-4096-ca-v2** model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
|
121 |
that has been created along with the model.
|
122 |
|
123 |
It contains the following tasks and their related datasets:
|
|
|
163 |
| ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
|
164 |
| RoBERTa-large-ca-v2 | **89.82** | **99.02** | **83.41** | **75.46** | 83.61 | **89.34/75.50** | **89.20**/75.77 | **90.72/79.06** | **73.79**/55.34 |
|
165 |
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/**75.91** | 89.50/76.63 | 73.64/**55.42** |
|
166 |
+
| Longformer-base-4096-ca-v2 | 88.49 | 98.98 | 78.37 | 73.79 | **83.89** | 87.59/72.33 | 88.70/**76.05** | 89.33/77.03 | 73.09/54.83 |
|
167 |
| BERTa | 89.76 | 98.96 | 80.19 | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/77.14 | 69.20/51.47 |
|
168 |
| mBERT | 86.87 | 98.83 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
|
169 |
| XLM-RoBERTa | 86.31 | 98.89 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
|