sschet commited on
Commit
d2f8f8e
·
1 Parent(s): 259ddbc

Upload 7 files

Browse files
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ ---
4
+
5
+ # SciBERT finetuned on JNLPA for NER downstream task
6
+ ## Language Model
7
+ [SciBERT](https://arxiv.org/pdf/1903.10676.pdf) is a pretrained language model based on BERT and trained by the
8
+ [Allen Institute for AI](https://allenai.org/) on papers from the corpus of
9
+ [Semantic Scholar](https://www.semanticscholar.org/).
10
+ Corpus size is 1.14M papers, 3.1B tokens. SciBERT has its own vocabulary (scivocab) that's built to best match
11
+ the training corpus.
12
+
13
+ ## Downstream task
14
+ [`allenai/scibert_scivocab_cased`](https://huggingface.co/allenai/scibert_scivocab_cased#) has been finetuned for Named Entity
15
+ Recognition (NER) dowstream task. The code to train the NER can be found [here](https://github.com/fran-martinez/bio_ner_bert).
16
+
17
+ ### Data
18
+ The corpus used to fine-tune the NER is [BioNLP / JNLPBA shared task](http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004).
19
+
20
+ - Training data consist of 2,000 PubMed abstracts with term/word annotation. This corresponds to 18,546 samples (senteces).
21
+ - Evaluation data consist of 404 PubMed abstracts with term/word annotation. This corresponds to 3,856 samples (sentences).
22
+
23
+ The classes (at word level) and its distribution (number of examples for each class) for training and evaluation datasets are shown below:
24
+
25
+ | Class Label | # training examples| # evaluation examples|
26
+ |:--------------|--------------:|----------------:|
27
+ |O | 382,963 | 81,647 |
28
+ |B-protein | 30,269 | 5,067 |
29
+ |I-protein | 24,848 | 4,774 |
30
+ |B-cell_type | 6,718 | 1,921 |
31
+ |I-cell_type | 8,748 | 2,991 |
32
+ |B-DNA | 9,533 | 1,056 |
33
+ |I-DNA | 15,774 | 1,789 |
34
+ |B-cell_line | 3,830 | 500 |
35
+ |I-cell_line | 7,387 | 9,89 |
36
+ |B-RNA | 951 | 118 |
37
+ |I-RNA | 1,530 | 187 |
38
+
39
+ ### Model
40
+ An exhaustive hyperparameter search was done.
41
+ The hyperparameters that provided the best results are:
42
+
43
+ - Max length sequence: 128
44
+ - Number of epochs: 6
45
+ - Batch size: 32
46
+ - Dropout: 0.3
47
+ - Optimizer: Adam
48
+
49
+ The used learning rate was 5e-5 with a decreasing linear schedule. A warmup was used at the beggining of the training
50
+ with a ratio of steps equal to 0.1 from the total training steps.
51
+
52
+ The model from the epoch with the best F1-score was selected, in this case, the model from epoch 5.
53
+
54
+
55
+ ### Evaluation
56
+ The following table shows the evaluation metrics calculated at span/entity level:
57
+
58
+ | | precision| recall| f1-score|
59
+ |:---------|-----------:|---------:|---------:|
60
+ cell_line | 0.5205 | 0.7100 | 0.6007 |
61
+ cell_type | 0.7736 | 0.7422 | 0.7576 |
62
+ protein | 0.6953 | 0.8459 | 0.7633 |
63
+ DNA | 0.6997 | 0.7894 | 0.7419 |
64
+ RNA | 0.6985 | 0.8051 | 0.7480 |
65
+ | | | |
66
+ **micro avg** | 0.6984 | 0.8076 | 0.7490|
67
+ **macro avg** | 0.7032 | 0.8076 | 0.7498 |
68
+
69
+ The macro F1-score is equal to 0.7498, compared to the value provided by the Allen Institute for AI in their
70
+ [paper](https://arxiv.org/pdf/1903.10676.pdf), which is equal to 0.7728. This drop in performance could be due to
71
+ several reasons, but one hypothesis could be the fact that the authors used an additional conditional random field,
72
+ while this model uses a regular classification layer with softmax activation on top of SciBERT model.
73
+
74
+ At word level, this model achieves a precision of 0.7742, a recall of 0.8536 and a F1-score of 0.8093.
75
+
76
+ ### Model usage in inference
77
+ Use the pipeline:
78
+ ````python
79
+ from transformers import pipeline
80
+
81
+ text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
82
+
83
+ nlp_ner = pipeline("ner",
84
+ model='fran-martinez/scibert_scivocab_cased_ner_jnlpba',
85
+ tokenizer='fran-martinez/scibert_scivocab_cased_ner_jnlpba')
86
+
87
+ nlp_ner(text)
88
+
89
+ """
90
+ Output:
91
+ ---------------------------
92
+ [
93
+ {'word': 'glucocorticoid',
94
+ 'score': 0.9894881248474121,
95
+ 'entity': 'B-protein'},
96
+
97
+ {'word': 'receptor',
98
+ 'score': 0.989505410194397,
99
+ 'entity': 'I-protein'},
100
+
101
+ {'word': 'normal',
102
+ 'score': 0.7680378556251526,
103
+ 'entity': 'B-cell_type'},
104
+
105
+ {'word': 'cs',
106
+ 'score': 0.5176806449890137,
107
+ 'entity': 'I-cell_type'},
108
+
109
+ {'word': 'lymphocytes',
110
+ 'score': 0.9898491501808167,
111
+ 'entity': 'I-cell_type'}
112
+ ]
113
+ """
114
+ ````
115
+ Or load model and tokenizer as follows:
116
+ ````python
117
+ import torch
118
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
119
+
120
+ # Example
121
+ text = "Mouse thymus was used as a source of glucocorticoid receptor from normal CS lymphocytes."
122
+
123
+ # Load model
124
+ tokenizer = AutoTokenizer.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
125
+ model = AutoModelForTokenClassification.from_pretrained("fran-martinez/scibert_scivocab_cased_ner_jnlpba")
126
+
127
+ # Get input for BERT
128
+ input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
129
+
130
+ # Predict
131
+ with torch.no_grad():
132
+ outputs = model(input_ids)
133
+
134
+ # From the output let's take the first element of the tuple.
135
+ # Then, let's get rid of [CLS] and [SEP] tokens (first and last)
136
+ predictions = outputs[0].argmax(axis=-1)[0][1:-1]
137
+
138
+ # Map label class indexes to string labels.
139
+ for token, pred in zip(tokenizer.tokenize(text), predictions):
140
+ print(token, '->', model.config.id2label[pred.numpy().item()])
141
+
142
+ """
143
+ Output:
144
+ ---------------------------
145
+ mouse -> O
146
+ thymus -> O
147
+ was -> O
148
+ used -> O
149
+ as -> O
150
+ a -> O
151
+ source -> O
152
+ of -> O
153
+ glucocorticoid -> B-protein
154
+ receptor -> I-protein
155
+ from -> O
156
+ normal -> B-cell_type
157
+ cs -> I-cell_type
158
+ lymphocytes -> I-cell_type
159
+ . -> O
160
+ """
161
+ ````
config.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_num_labels": 11,
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.3,
7
+ "bos_token_id": null,
8
+ "do_sample": false,
9
+ "early_stopping": false,
10
+ "eos_token_id": null,
11
+ "finetuning_task": null,
12
+ "hidden_act": "gelu",
13
+ "hidden_dropout_prob": 0.3,
14
+ "hidden_size": 768,
15
+ "id2label": {
16
+ "0": "I-cell_type",
17
+ "1": "B-DNA",
18
+ "10": "B-cell_type",
19
+ "2": "O",
20
+ "3": "I-cell_line",
21
+ "4": "I-protein",
22
+ "5": "I-RNA",
23
+ "6": "B-cell_line",
24
+ "7": "B-RNA",
25
+ "8": "I-DNA",
26
+ "9": "B-protein"
27
+ },
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 3072,
30
+ "is_decoder": false,
31
+ "is_encoder_decoder": false,
32
+ "label2id": {
33
+ "LABEL_0": 0,
34
+ "LABEL_1": 1,
35
+ "LABEL_10": 10,
36
+ "LABEL_2": 2,
37
+ "LABEL_3": 3,
38
+ "LABEL_4": 4,
39
+ "LABEL_5": 5,
40
+ "LABEL_6": 6,
41
+ "LABEL_7": 7,
42
+ "LABEL_8": 8,
43
+ "LABEL_9": 9
44
+ },
45
+ "layer_norm_eps": 1e-12,
46
+ "length_penalty": 1.0,
47
+ "max_length": 20,
48
+ "max_position_embeddings": 512,
49
+ "min_length": 0,
50
+ "model_type": "bert",
51
+ "no_repeat_ngram_size": 0,
52
+ "num_attention_heads": 12,
53
+ "num_beams": 1,
54
+ "num_hidden_layers": 12,
55
+ "num_return_sequences": 1,
56
+ "output_attentions": false,
57
+ "output_hidden_states": false,
58
+ "output_past": true,
59
+ "pad_token_id": 0,
60
+ "pruned_heads": {},
61
+ "repetition_penalty": 1.0,
62
+ "temperature": 1.0,
63
+ "top_k": 50,
64
+ "top_p": 1.0,
65
+ "torchscript": false,
66
+ "type_vocab_size": 2,
67
+ "use_bfloat16": false,
68
+ "vocab_size": 31090
69
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:69666bb5a436690197ee7e3ff010891140b85cc3dab7013a205df9555cce00ea
3
+ size 437352466
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f318c0c9452000f211edc4bc5b7eb0fea906e55544af8004d3ab09cea02924eb
3
+ size 439757565
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff