Upload README.md with huggingface_hub

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +84 -86
README.md CHANGED
@@ -1,86 +1,84 @@
1
- ---
2
- language:
3
- - english
4
- thumbnail:
5
- tags:
6
- - token classification
7
- license: agpl-3.0
8
- datasets:
9
- - EMBO/sd-nlp
10
- metrics:
11
- -
12
- ---
13
-
14
- # sd-smallmol-roles
15
-
16
- ## Model description
17
-
18
- This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It has then been fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `SMALL_MOL_ROLES` configuration to perform pure context-dependent semantic role classification of bioentities.
19
-
20
-
21
- ## Intended uses & limitations
22
-
23
- #### How to use
24
-
25
- The intended use of this model is to infer the semantic role of small molecules with regard to the causal hypotheses tested in experiments reported in scientific papers.
26
-
27
- To have a quick check of the model:
28
-
29
- ```python
30
- from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
31
- example = """<s>The <mask> overexpression in cells caused an increase in <mask> expression.</s>"""
32
- tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
33
- model = RobertaForTokenClassification.from_pretrained('EMBO/sd-smallmol-roles')
34
- ner = pipeline('ner', model, tokenizer=tokenizer)
35
- res = ner(example)
36
- for r in res:
37
- print(r['word'], r['entity'])
38
- ```
39
-
40
- #### Limitations and bias
41
-
42
- The model must be used with the `roberta-base` tokenizer.
43
-
44
- ## Training data
45
-
46
- The model was trained for token classification using the [EMBO/sd-nlp dataset](https://huggingface.co/datasets/EMBO/sd-nlp) which includes manually annotated examples.
47
-
48
- ## Training procedure
49
-
50
- The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
51
-
52
- Training code is available at https://github.com/source-data/soda-roberta
53
-
54
- - Model fine tuned: EMBL/bio-lm
55
- - Tokenizer vocab size: 50265
56
- - Training data: EMBO/sd-nlp
57
- - Dataset configuration: SMALL_MOL_ROLES
58
- - Training with 48771 examples.
59
- - Evaluating on 13801 examples.
60
- - Training on 15 features: O, I-CONTROLLED_VAR, B-CONTROLLED_VAR, I-MEASURED_VAR, B-MEASURED_VAR
61
- - Epochs: 0.33
62
- - `per_device_train_batch_size`: 16
63
- - `per_device_eval_batch_size`: 16
64
- - `learning_rate`: 0.0001
65
- - `weight_decay`: 0.0
66
- - `adam_beta1`: 0.9
67
- - `adam_beta2`: 0.999
68
- - `adam_epsilon`: 1e-08
69
- - `max_grad_norm`: 1.0
70
-
71
- ## Eval results
72
-
73
- On 7178 example of test set with `sklearn.metrics`:
74
-
75
- ```
76
- precision recall f1-score support
77
-
78
- CONTROLLED_VAR 0.76 0.90 0.83 2946
79
- MEASURED_VAR 0.60 0.71 0.65 852
80
-
81
- micro avg 0.73 0.86 0.79 3798
82
- macro avg 0.68 0.80 0.74 3798
83
- weighted avg 0.73 0.86 0.79 3798
84
-
85
- {'test_loss': 0.011743436567485332, 'test_accuracy_score': 0.9951612532624371, 'test_precision': 0.7261345852895149, 'test_recall': 0.8551869404949973, 'test_f1': 0.7853947527505744, 'test_runtime': 58.0378, 'test_samples_per_second': 123.678, 'test_steps_per_second': 1.947}
86
- ```
 
1
+ ---
2
+ language: en
3
+ license: agpl-3.0
4
+ tags:
5
+ - token classification
6
+ datasets:
7
+ - EMBO/sd-nlp
8
+ metrics: []
9
+ ---
10
+
11
+
12
+ # sd-smallmol-roles
13
+
14
+ ## Model description
15
+
16
+ This model is a [RoBERTa base model](https://huggingface.co/roberta-base) that was further trained using a masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). It has then been fine-tuned for token classification on the SourceData [sd-nlp](https://huggingface.co/datasets/EMBO/sd-nlp) dataset with the `SMALL_MOL_ROLES` configuration to perform pure context-dependent semantic role classification of bioentities.
17
+
18
+
19
+ ## Intended uses & limitations
20
+
21
+ #### How to use
22
+
23
+ The intended use of this model is to infer the semantic role of small molecules with regard to the causal hypotheses tested in experiments reported in scientific papers.
24
+
25
+ To have a quick check of the model:
26
+
27
+ ```python
28
+ from transformers import pipeline, RobertaTokenizerFast, RobertaForTokenClassification
29
+ example = """<s>The <mask> overexpression in cells caused an increase in <mask> expression.</s>"""
30
+ tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512)
31
+ model = RobertaForTokenClassification.from_pretrained('EMBO/sd-smallmol-roles')
32
+ ner = pipeline('ner', model, tokenizer=tokenizer)
33
+ res = ner(example)
34
+ for r in res:
35
+ print(r['word'], r['entity'])
36
+ ```
37
+
38
+ #### Limitations and bias
39
+
40
+ The model must be used with the `roberta-base` tokenizer.
41
+
42
+ ## Training data
43
+
44
+ The model was trained for token classification using the [EMBO/sd-nlp dataset](https://huggingface.co/datasets/EMBO/sd-nlp) which includes manually annotated examples.
45
+
46
+ ## Training procedure
47
+
48
+ The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs.
49
+
50
+ Training code is available at https://github.com/source-data/soda-roberta
51
+
52
+ - Model fine tuned: EMBL/bio-lm
53
+ - Tokenizer vocab size: 50265
54
+ - Training data: EMBO/sd-nlp
55
+ - Dataset configuration: SMALL_MOL_ROLES
56
+ - Training with 48771 examples.
57
+ - Evaluating on 13801 examples.
58
+ - Training on 15 features: O, I-CONTROLLED_VAR, B-CONTROLLED_VAR, I-MEASURED_VAR, B-MEASURED_VAR
59
+ - Epochs: 0.33
60
+ - `per_device_train_batch_size`: 16
61
+ - `per_device_eval_batch_size`: 16
62
+ - `learning_rate`: 0.0001
63
+ - `weight_decay`: 0.0
64
+ - `adam_beta1`: 0.9
65
+ - `adam_beta2`: 0.999
66
+ - `adam_epsilon`: 1e-08
67
+ - `max_grad_norm`: 1.0
68
+
69
+ ## Eval results
70
+
71
+ On 7178 example of test set with `sklearn.metrics`:
72
+
73
+ ```
74
+ precision recall f1-score support
75
+
76
+ CONTROLLED_VAR 0.76 0.90 0.83 2946
77
+ MEASURED_VAR 0.60 0.71 0.65 852
78
+
79
+ micro avg 0.73 0.86 0.79 3798
80
+ macro avg 0.68 0.80 0.74 3798
81
+ weighted avg 0.73 0.86 0.79 3798
82
+
83
+ {'test_loss': 0.011743436567485332, 'test_accuracy_score': 0.9951612532624371, 'test_precision': 0.7261345852895149, 'test_recall': 0.8551869404949973, 'test_f1': 0.7853947527505744, 'test_runtime': 58.0378, 'test_samples_per_second': 123.678, 'test_steps_per_second': 1.947}
84
+ ```