update readme
Browse files
README.md
CHANGED
@@ -40,14 +40,14 @@ The training corpus is composed of several biomedical corpora in Spanish, collec
|
|
40 |
|
41 |
| Name | No. tokens | Description |
|
42 |
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
43 |
-
| [Medical crawler](https://zenodo.org/record/4561971#.YTtwM32xXbQ) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish health
|
44 |
-
| Scielo
|
45 |
-
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines
|
46 |
-
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles
|
47 |
-
| Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P"
|
48 |
-
| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from
|
49 |
-
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side
|
50 |
-
| PubMed | 1,858,966 |
|
51 |
|
52 |
To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
|
53 |
|
@@ -84,7 +84,7 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
|
|
84 |
|
85 |
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
|
86 |
|
87 |
-
However, the is intended to be fine-tuned on
|
88 |
|
89 |
---
|
90 |
|
|
|
40 |
|
41 |
| Name | No. tokens | Description |
|
42 |
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
43 |
+
| [Medical crawler](https://zenodo.org/record/4561971#.YTtwM32xXbQ) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
|
44 |
+
| [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
45 |
+
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
46 |
+
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
|
47 |
+
| Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
|
48 |
+
| [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
|
49 |
+
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
|
50 |
+
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
51 |
|
52 |
To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
|
53 |
|
|
|
84 |
|
85 |
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
|
86 |
|
87 |
+
However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
88 |
|
89 |
---
|
90 |
|