gonzalez-agirre commited on
Commit
1947df0
·
1 Parent(s): dfc8775

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - "catalan"
7
  - "masked-lm"
8
  - "longformer"
9
- - "longformer-base-4096-ca"
10
  - "CaText"
11
  - "Catalan Textual Corpus"
12
 
@@ -22,7 +22,7 @@ widget:
22
 
23
  ---
24
 
25
- # Catalan Longformer (longformer-base-4096-ca) base model
26
 
27
  ## Table of Contents
28
  <details>
@@ -52,13 +52,13 @@ widget:
52
 
53
  ## Model description
54
 
55
- The **longformer-base-4096-ca** is the [Longformer](https://huggingface.co/allenai/longformer-base-4096) version of the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) masked language model for the Catalan language. Using this Longformer architecture we can process contexts of up to 4096 tokens without the need of additional aggregation strategies. The pretraining process of this model started from the **roberta-base-ca-v2** checkpoint and was pretrained for MLM on both short and long documents in Catalan.
56
 
57
  The Longformer model uses a combination of sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. Please refer to the original [paper](https://arxiv.org/abs/2004.05150) for more details on how to set global attention.
58
 
59
  ## Intended uses and limitations
60
 
61
- The **longformer-base-4096-ca** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
62
  However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
63
 
64
  ## How to use
@@ -69,8 +69,8 @@ Here is how to use this model:
69
  from transformers import AutoModelForMaskedLM
70
  from transformers import AutoTokenizer, FillMaskPipeline
71
  from pprint import pprint
72
- tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/longformer-base-4096-ca')
73
- model = AutoModelForMaskedLM.from_pretrained('projecte-aina/longformer-base-4096-ca')
74
  model.eval()
75
  pipeline = FillMaskPipeline(model, tokenizer_hf)
76
  text = f"Em dic <mask>."
@@ -106,7 +106,7 @@ The training corpus consists of several corpora gathered from web crawling and p
106
  | Vilaweb | 0.06 |
107
  | Tweets | 0.02 |
108
 
109
- For this specific pre-training process we have performed an undersampling process to obtain a corpus of 5,3 GB.
110
 
111
  ### Training procedure
112
 
@@ -117,7 +117,7 @@ The training corpus has been tokenized using a byte version of Byte-Pair Encodin
117
 
118
  ### CLUB benchmark
119
 
120
- The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
121
  that has been created along with the model.
122
 
123
  It contains the following tasks and their related datasets:
@@ -163,7 +163,7 @@ After fine-tuning the model on the downstream tasks, it achieved the following p
163
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
164
  | RoBERTa-large-ca-v2 | **89.82** | **99.02** | **83.41** | **75.46** | 83.61 | **89.34/75.50** | **89.20**/75.77 | **90.72/79.06** | **73.79**/55.34 |
165
  | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/**75.91** | 89.50/76.63 | 73.64/**55.42** |
166
- | Longformer-base-4096-ca | 88.49 | 98.98 | 78.37 | 73.79 | **83.89** | 87.59/72.33 | 88.70/**76.05** | 89.33/77.03 | 73.09/54.83 |
167
  | BERTa | 89.76 | 98.96 | 80.19 | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/77.14 | 69.20/51.47 |
168
  | mBERT | 86.87 | 98.83 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
169
  | XLM-RoBERTa | 86.31 | 98.89 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |
 
6
  - "catalan"
7
  - "masked-lm"
8
  - "longformer"
9
+ - "longformer-base-4096-ca-v2"
10
  - "CaText"
11
  - "Catalan Textual Corpus"
12
 
 
22
 
23
  ---
24
 
25
+ # Catalan Longformer (longformer-base-4096-ca-v2) base model
26
 
27
  ## Table of Contents
28
  <details>
 
52
 
53
  ## Model description
54
 
55
+ The **longformer-base-4096-ca-v2** is the [Longformer](https://huggingface.co/allenai/longformer-base-4096) version of the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) masked language model for the Catalan language. The use of these models allows us to process larger contexts (up to 4096 tokens) as input without the need of additional aggregation strategies. The pretraining process of this model started from the **roberta-base-ca-v2** checkpoint and was pretrained for MLM on both short and long documents in Catalan.
56
 
57
  The Longformer model uses a combination of sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. Please refer to the original [paper](https://arxiv.org/abs/2004.05150) for more details on how to set global attention.
58
 
59
  ## Intended uses and limitations
60
 
61
+ The **longformer-base-4096-ca-v2** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
62
  However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
63
 
64
  ## How to use
 
69
  from transformers import AutoModelForMaskedLM
70
  from transformers import AutoTokenizer, FillMaskPipeline
71
  from pprint import pprint
72
+ tokenizer_hf = AutoTokenizer.from_pretrained('projecte-aina/longformer-base-4096-ca-v2')
73
+ model = AutoModelForMaskedLM.from_pretrained('projecte-aina/longformer-base-4096-ca-v2')
74
  model.eval()
75
  pipeline = FillMaskPipeline(model, tokenizer_hf)
76
  text = f"Em dic <mask>."
 
106
  | Vilaweb | 0.06 |
107
  | Tweets | 0.02 |
108
 
109
+ For this specific pre-training process, we have performed an undersampling process to obtain a corpus of 5,3 GB.
110
 
111
  ### Training procedure
112
 
 
117
 
118
  ### CLUB benchmark
119
 
120
+ The **longformer-base-4096-ca-v2** model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
121
  that has been created along with the model.
122
 
123
  It contains the following tasks and their related datasets:
 
163
  | ------------|:-------------:| -----:|:------|:------|:-------|:------|:----|:----|:----|
164
  | RoBERTa-large-ca-v2 | **89.82** | **99.02** | **83.41** | **75.46** | 83.61 | **89.34/75.50** | **89.20**/75.77 | **90.72/79.06** | **73.79**/55.34 |
165
  | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 87.74/72.58 | 88.72/**75.91** | 89.50/76.63 | 73.64/**55.42** |
166
+ | Longformer-base-4096-ca-v2 | 88.49 | 98.98 | 78.37 | 73.79 | **83.89** | 87.59/72.33 | 88.70/**76.05** | 89.33/77.03 | 73.09/54.83 |
167
  | BERTa | 89.76 | 98.96 | 80.19 | 73.65 | 79.26 | 85.93/70.58 | 87.12/73.11 | 89.17/77.14 | 69.20/51.47 |
168
  | mBERT | 86.87 | 98.83 | 74.26 | 69.90 | 74.63 | 82.78/67.33 | 86.89/73.53 | 86.90/74.19 | 68.79/50.80 |
169
  | XLM-RoBERTa | 86.31 | 98.89 | 61.61 | 70.14 | 33.30 | 86.29/71.83 | 86.88/73.11 | 88.17/75.93 | 72.55/54.16 |