tlemberger commited on
Commit
3d96e5f
1 Parent(s): 0b28f00
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -15,14 +15,14 @@ metrics:
15
 
16
  ## Model description
17
 
18
- This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It was then fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset wit the `NER` task to perform Named Entity Recognition of bioentities.
19
 
20
 
21
  ## Intended uses & limitations
22
 
23
  #### How to use
24
 
25
- The intended use of this model is for Named Entity Recognition of biological entitie used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.
26
 
27
  To have a quick check of the model:
28
 
@@ -51,8 +51,10 @@ The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
51
 
52
  Training code is available at https://github.com/source-data/soda-roberta
53
 
 
54
  - Tokenizer vocab size: 50265
55
- - Training data: EMBO/sd-nlp NER
 
56
  - Training with 48771 examples.
57
  - Evaluating on 13801 examples.
58
  - Training on 15 features: O, I-SMALL_MOLECULE, B-SMALL_MOLECULE, I-GENEPROD, B-GENEPROD, I-SUBCELLULAR, B-SUBCELLULAR, I-CELL, B-CELL, I-TISSUE, B-TISSUE, I-ORGANISM, B-ORGANISM, I-EXP_ASSAY, B-EXP_ASSAY
 
15
 
16
  ## Model description
17
 
18
+ This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of English scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It was then fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `NER` configuration to perform Named Entity Recognition of bioentities.
19
 
20
 
21
  ## Intended uses & limitations
22
 
23
  #### How to use
24
 
25
+ The intended use of this model is for Named Entity Recognition of biological entities used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.
26
 
27
  To have a quick check of the model:
28
 
 
51
 
52
  Training code is available at https://github.com/source-data/soda-roberta
53
 
54
+ - Model fine-tuned: EMBO/bio-lm
55
  - Tokenizer vocab size: 50265
56
+ - Training data: EMBO/sd-nlp
57
+ - Dataset configuration: NER
58
  - Training with 48771 examples.
59
  - Evaluating on 13801 examples.
60
  - Training on 15 features: O, I-SMALL_MOLECULE, B-SMALL_MOLECULE, I-GENEPROD, B-GENEPROD, I-SUBCELLULAR, B-SUBCELLULAR, I-CELL, B-CELL, I-TISSUE, B-TISSUE, I-ORGANISM, B-ORGANISM, I-EXP_ASSAY, B-EXP_ASSAY