Update README.md
Browse files
README.md
CHANGED
@@ -33,22 +33,22 @@ widget:
|
|
33 |
|
34 |
- [Overview](#overview)
|
35 |
- [Model Description](#model-description)
|
36 |
-
- [How to Use](#how-to-use)
|
37 |
- [Intended Uses and Limitations](#intended-uses-and-limitations)
|
|
|
38 |
- [Training](#training)
|
39 |
- [Training Data](#training-data)
|
40 |
- [Training Procedure](#training-procedure)
|
41 |
- [Evaluation](#evaluation)
|
42 |
- [Evaluation Results](#evaluation-results)
|
43 |
- [Additional Information](#additional-information)
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
</details>
|
53 |
|
54 |
## Overview
|
@@ -60,6 +60,13 @@ widget:
|
|
60 |
## Model Description
|
61 |
RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
## How to Use
|
64 |
You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
|
65 |
|
@@ -102,12 +109,6 @@ Here is how to use this model to get the features of a given text in PyTorch:
|
|
102 |
torch.Size([1, 19, 1024])
|
103 |
```
|
104 |
|
105 |
-
## Intended Uses and Limitations
|
106 |
-
You can use the raw model for fill mask or fine-tune it to a downstream task.
|
107 |
-
|
108 |
-
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
|
109 |
-
unfiltered content from the internet, which is far from neutral. At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
110 |
-
|
111 |
## Training
|
112 |
|
113 |
### Training Data
|
@@ -150,11 +151,24 @@ For more evaluation details visit our [GitHub repository](https://github.com/Pla
|
|
150 |
|
151 |
## Additional Information
|
152 |
|
153 |
-
###
|
154 |
|
155 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
|
157 |
### Citation Information
|
|
|
158 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
159 |
```
|
160 |
@article{,
|
@@ -174,21 +188,9 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
|
|
174 |
}
|
175 |
```
|
176 |
|
177 |
-
###
|
178 |
-
|
179 |
-
For further information, send an email to <[email protected]>
|
180 |
-
|
181 |
-
### Funding
|
182 |
-
|
183 |
-
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
184 |
-
|
185 |
-
### Licensing Information
|
186 |
|
187 |
-
|
188 |
-
|
189 |
-
### Copyright
|
190 |
-
|
191 |
-
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
192 |
|
193 |
### Disclaimer
|
194 |
|
|
|
33 |
|
34 |
- [Overview](#overview)
|
35 |
- [Model Description](#model-description)
|
|
|
36 |
- [Intended Uses and Limitations](#intended-uses-and-limitations)
|
37 |
+
- [How to Use](#how-to-use)
|
38 |
- [Training](#training)
|
39 |
- [Training Data](#training-data)
|
40 |
- [Training Procedure](#training-procedure)
|
41 |
- [Evaluation](#evaluation)
|
42 |
- [Evaluation Results](#evaluation-results)
|
43 |
- [Additional Information](#additional-information)
|
44 |
+
- [Contact Information](#contact-information)
|
45 |
+
- [Copyright](#copyright)
|
46 |
+
- [Licensing Information](#licensing-information)
|
47 |
+
- [Funding](#funding)
|
48 |
+
- [Citation Information](#citation-information)
|
49 |
+
- [Contributions](#contributions)
|
50 |
+
- [Disclaimer](#disclaimer)
|
51 |
+
|
52 |
</details>
|
53 |
|
54 |
## Overview
|
|
|
60 |
## Model Description
|
61 |
RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
62 |
|
63 |
+
|
64 |
+
## Intended Uses and Limitations
|
65 |
+
You can use the raw model for fill mask or fine-tune it to a downstream task.
|
66 |
+
|
67 |
+
The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
|
68 |
+
unfiltered content from the internet, which is far from neutral. At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
69 |
+
|
70 |
## How to Use
|
71 |
You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
|
72 |
|
|
|
109 |
torch.Size([1, 19, 1024])
|
110 |
```
|
111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
## Training
|
113 |
|
114 |
### Training Data
|
|
|
151 |
|
152 |
## Additional Information
|
153 |
|
154 |
+
### Contact Information
|
155 |
|
156 |
+
For further information, send an email to <plantl-gob-es@bsc.es>
|
157 |
+
|
158 |
+
### Copyright
|
159 |
+
|
160 |
+
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
161 |
+
|
162 |
+
### Licensing Information
|
163 |
+
|
164 |
+
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
165 |
+
|
166 |
+
### Funding
|
167 |
+
|
168 |
+
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
169 |
|
170 |
### Citation Information
|
171 |
+
|
172 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
173 |
```
|
174 |
@article{,
|
|
|
188 |
}
|
189 |
```
|
190 |
|
191 |
+
### Contributions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
|
193 |
+
[N/A]
|
|
|
|
|
|
|
|
|
194 |
|
195 |
### Disclaimer
|
196 |
|