Spaces:
Running
Running
hiverlab-nicholastkb
commited on
Commit
·
c5e9744
1
Parent(s):
1bb4a69
Public commit
Browse files- README.md +52 -12
- app.py +122 -0
- requirements.txt +9 -0
- trainedSB2/config.json +43 -0
- trainedSB2/pytorch_model.bin +3 -0
- trainedSB2/special_tokens_map.json +7 -0
- trainedSB2/tokenizer.json +0 -0
- trainedSB2/tokenizer_config.json +15 -0
- trainedSB2/training_args.bin +3 -0
- trainedSB2/vocab.txt +0 -0
README.md
CHANGED
@@ -1,12 +1,52 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# LLMGeneLinker (LGL): a Fine-Tuned SciBERT Model for Named Entity Recognition
|
2 |
+
|
3 |
+
LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT
|
4 |
+
|
5 |
+
## Table of Contents
|
6 |
+
|
7 |
+
- [Model Overview](#model-overview)
|
8 |
+
- [Usage](#usage)
|
9 |
+
- [Installation](#installation)
|
10 |
+
|
11 |
+
- [Dataset](#dataset)
|
12 |
+
|
13 |
+
- [Contributing](#contributing)
|
14 |
+
- [License](#license)
|
15 |
+
|
16 |
+
## Model Overview
|
17 |
+
|
18 |
+
The model is based on the [SciBERT](https://github.com/allenai/scibert) architecture, which is a pre-trained language model specifically designed for the biomedical domain. By fine-tuning SciBERT on a labeled dataset, we have created a specialized NER model that can accurately recognize drugs, genes, and diseases in biomedical texts.
|
19 |
+
|
20 |
+
## Usage
|
21 |
+
You can access an interactive web interface for querying the fine-tuned LGL model [here](spacelink). If you prefer to load the model yourself, you can check out [Installation](#installation) below.
|
22 |
+
|
23 |
+
## Installation
|
24 |
+
To use LGL, you need to install the required dependencies and download the model files. Follow the steps below to set up the environment:
|
25 |
+
|
26 |
+
1. Clone this repository to your local machine.
|
27 |
+
1.1 If you do not have Python installed, download python via the official sources. Anaconda is recommended if you use scientific packages often.
|
28 |
+
|
29 |
+
If using anaconda, after installation setup a new conda environment via the following (replace *myname* with your own choice of environment name):
|
30 |
+
```conda create --name *myname* python==3.8```
|
31 |
+
|
32 |
+
2. Activate your venv/ conda env (if using) and install the required Python packages using `pip`:
|
33 |
+
|
34 |
+
```pip install -r requirements.txt```
|
35 |
+
|
36 |
+
3. To utilize the fine-tuned NER model for recognizing drugs, genes, and diseases, you can open `demo.ipynb` in Jupyter Lab by starting Jupyter Lab via ```jupyter lab```. The script takes text input as a string and returns the identified entities along with their respective labels.
|
37 |
+
|
38 |
+
## Dataset
|
39 |
+
|
40 |
+
The following datasets were processed and used for training and evaluation:
|
41 |
+
Most datasets were sourced from `BigBIO` [GitHub] (https://github.com/bigscience-workshop/biomedical/blob/main/README.md) [HF] (https://huggingface.co/bigbio)
|
42 |
+
|
43 |
+
| Task Type | Dataset | Links ||
|
44 |
+
|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
|
45 |
+
| NER | NCBI-disease | [Link](https://huggingface.co/datasets/bigbio/ncbi_disease)|
|
46 |
+
| NER | BC5-disease | [Link](https://huggingface.co/datasets/bigbio/bc5cdr)|
|
47 |
+
| NER | Genetag | [Link](https://huggingface.co/datasets/bigbio/genetag)|
|
48 |
+
| NER/RE | Drugprot | [Link](https://huggingface.co/datasets/bigbio/drugprot)|
|
49 |
+
| NER/RE | AllenAI Drug-Combo-Extraction | [Link](https://huggingface.co/datasets/allenai/drug-combo-extraction)|
|
50 |
+
|
51 |
+
|
52 |
+
|
app.py
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
# coding: utf-8
|
3 |
+
|
4 |
+
# In[1]:
|
5 |
+
|
6 |
+
|
7 |
+
from datasets import Dataset, ClassLabel, Sequence, load_dataset, load_metric
|
8 |
+
import numpy as np
|
9 |
+
import pandas as pd
|
10 |
+
import bioc
|
11 |
+
from spacy import displacy
|
12 |
+
import transformers
|
13 |
+
#import evaluate
|
14 |
+
from transformers import (AutoModelForTokenClassification,
|
15 |
+
AutoTokenizer,
|
16 |
+
DataCollatorForTokenClassification,
|
17 |
+
pipeline,
|
18 |
+
TrainingArguments,
|
19 |
+
Trainer)
|
20 |
+
|
21 |
+
|
22 |
+
# In[2]:
|
23 |
+
|
24 |
+
|
25 |
+
label_list = ['O', 'B-DRUG', 'I-DRUG', 'B-DISEASE', 'I-DISEASE', 'B-GENE', 'I-GENE']
|
26 |
+
model_checkpoint = './trainedSB2'
|
27 |
+
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
|
28 |
+
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
|
29 |
+
effect_ner_model = pipeline(task="ner", model=model, tokenizer=tokenizer)
|
30 |
+
|
31 |
+
|
32 |
+
# In[21]:
|
33 |
+
|
34 |
+
|
35 |
+
def visualize_entities(sentence):
|
36 |
+
tokens = effect_ner_model(sentence)
|
37 |
+
entities = []
|
38 |
+
# ['O', 'B-DRUG', 'I-DRUG', 'B-DISEASE', 'I-DISEASE', 'B-GENE', 'I-GENE']
|
39 |
+
for token in tokens:
|
40 |
+
label = int(token["entity"][-1])
|
41 |
+
if label != 0:
|
42 |
+
token["label"] = label_list[label]
|
43 |
+
entities.append(token)
|
44 |
+
|
45 |
+
params = [{"text": sentence,
|
46 |
+
"ents": entities,
|
47 |
+
"title": None}]
|
48 |
+
|
49 |
+
html = displacy.render(params, style="ent", manual=True, options={
|
50 |
+
"colors": {
|
51 |
+
|
52 |
+
"B-DRUG": "#f08080",
|
53 |
+
"I-DRUG": "#f08080",
|
54 |
+
"B-DISEASE": "#9bddff",
|
55 |
+
"I-DISEASE": "#9bddff",
|
56 |
+
"B-GENE": "#008080",
|
57 |
+
"I-GENE": "#008080",
|
58 |
+
},
|
59 |
+
})
|
60 |
+
return html
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
+
# In[25]:
|
65 |
+
|
66 |
+
|
67 |
+
import gradio as gr
|
68 |
+
|
69 |
+
exampleList = [
|
70 |
+
'Famotidine is a histamine H2-receptor antagonist used in inpatient settings for prevention of stress ulcers and is showing increasing popularity because of its low cost.',
|
71 |
+
'A randomized Phase III trial demonstrated noninferiority of APF530 500 mg SC ( granisetron 10 mg ) to intravenous palonosetron 0.25 mg in preventing CINV in patients receiving MEC or HEC in acute ( 0 - 24 hours ) and delayed ( 24 - 120 hours ) settings , with activity over 120 hours .',
|
72 |
+
'What are the known interactions between Aspirin and the COX-1 enzyme?',
|
73 |
+
'Can you explain the mechanism of action of Metformin and its effect on the AMPK pathway?',
|
74 |
+
'Are there any genetic variations in the CYP2C9 gene that may influence the response to Warfarin therapy?',
|
75 |
+
'I am curious about the role of Herceptin in targeting the HER2/neu protein in breast cancer treatment. How does it work?',
|
76 |
+
'What are the common side effects associated with Lisinopril, an angiotensin-converting enzyme (ACE) inhibitor?',
|
77 |
+
'Can you explain the significance of the BCR-ABL fusion protein in the context of Imatinib therapy for chronic myeloid leukemia (CML)?',
|
78 |
+
'How does Ibuprofen affect the COX-2 enzyme compared to COX-1?',
|
79 |
+
'Are there any recent studies exploring the use of Pembrolizumab as an immune checkpoint inhibitor targeting PD-1?',
|
80 |
+
'I have heard about the SLC6A4 gene and its association with serotonin reuptake inhibitors (SSRIs) like Fluoxetine.',
|
81 |
+
'Could you provide insights into the BRAF mutation and its relevance in response to Vemurafenib treatment in melanoma patients?'
|
82 |
+
]
|
83 |
+
|
84 |
+
footer = """
|
85 |
+
LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT
|
86 |
+
This was made during the <a target="_blank" href =https://www.sginnovate.com/event/hackathon-large-language-models-bio> LLMs for Bio Hackathon</a> organised by 4Catalyzer and SGInnovate.
|
87 |
+
<br>
|
88 |
+
Made by Team GeneLink (<a target="_blank" href=https://www.linkedin.com/in/ntkb/>Nicholas</a>, <a target="_blank" href=https://www.linkedin.com/in/yewchong-sim/>Yew Chong</a>, <a target="_blank" href=https://www.linkedin.com/in/lim-ting-wei-021383175/>Ting Wei</a>, <a target="_blank" href=https://www.linkedin.com/in/brendan-lim-ciwen/>Brendan</a>
|
89 |
+
<hr>
|
90 |
+
Note: Performance is noted to be poorer on genes, acronyms, and receptors (named entities that may be targets for drugs or genes)
|
91 |
+
Original notebook adapted from <a target="_blank" href=https://huggingface.co/jsylee/scibert_scivocab_uncased-finetuned-ner>jsylee/scibert_scivocab_uncased-finetuned-ner</a>
|
92 |
+
"""
|
93 |
+
|
94 |
+
with gr.Blocks() as demo:
|
95 |
+
gr.Markdown("## LLMGeneLinker (LGL)")
|
96 |
+
gr.Markdown(footer)
|
97 |
+
|
98 |
+
txt = gr.Textbox(label="Input", lines=2)
|
99 |
+
txt_3 = gr.HTML(label="Output")
|
100 |
+
btn = gr.Button(value="Submit")
|
101 |
+
btn.click(visualize_entities, inputs=txt, outputs=txt_3)
|
102 |
+
|
103 |
+
gr.Markdown("## Text Examples")
|
104 |
+
gr.Examples(
|
105 |
+
[[x] for x in exampleList],
|
106 |
+
txt,
|
107 |
+
txt_3,
|
108 |
+
visualize_entities,
|
109 |
+
cache_examples=False,
|
110 |
+
run_on_click=True
|
111 |
+
)
|
112 |
+
|
113 |
+
|
114 |
+
if __name__ == "__main__":
|
115 |
+
demo.launch()
|
116 |
+
|
117 |
+
|
118 |
+
# In[ ]:
|
119 |
+
|
120 |
+
|
121 |
+
|
122 |
+
|
requirements.txt
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
huggingface_hub
|
2 |
+
jupyterlab
|
3 |
+
datasets
|
4 |
+
transformers[torch]
|
5 |
+
seqeval
|
6 |
+
spacy
|
7 |
+
evaluate
|
8 |
+
bioc
|
9 |
+
gradio
|
trainedSB2/config.json
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "allenai/scibert_scivocab_uncased",
|
3 |
+
"architectures": [
|
4 |
+
"BertForTokenClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"id2label": {
|
12 |
+
"0": "LABEL_0",
|
13 |
+
"1": "LABEL_1",
|
14 |
+
"2": "LABEL_2",
|
15 |
+
"3": "LABEL_3",
|
16 |
+
"4": "LABEL_4",
|
17 |
+
"5": "LABEL_5",
|
18 |
+
"6": "LABEL_6"
|
19 |
+
},
|
20 |
+
"initializer_range": 0.02,
|
21 |
+
"intermediate_size": 3072,
|
22 |
+
"label2id": {
|
23 |
+
"LABEL_0": 0,
|
24 |
+
"LABEL_1": 1,
|
25 |
+
"LABEL_2": 2,
|
26 |
+
"LABEL_3": 3,
|
27 |
+
"LABEL_4": 4,
|
28 |
+
"LABEL_5": 5,
|
29 |
+
"LABEL_6": 6
|
30 |
+
},
|
31 |
+
"layer_norm_eps": 1e-12,
|
32 |
+
"max_position_embeddings": 512,
|
33 |
+
"model_type": "bert",
|
34 |
+
"num_attention_heads": 12,
|
35 |
+
"num_hidden_layers": 12,
|
36 |
+
"pad_token_id": 0,
|
37 |
+
"position_embedding_type": "absolute",
|
38 |
+
"torch_dtype": "float32",
|
39 |
+
"transformers_version": "4.31.0",
|
40 |
+
"type_vocab_size": 2,
|
41 |
+
"use_cache": true,
|
42 |
+
"vocab_size": 31090
|
43 |
+
}
|
trainedSB2/pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b09b5b25e4643e98d3404b02ad8e83801ef9c7025d1904e59d2f8d76e65eb883
|
3 |
+
size 437400745
|
trainedSB2/special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
trainedSB2/tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
trainedSB2/tokenizer_config.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"clean_up_tokenization_spaces": true,
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"do_basic_tokenize": true,
|
5 |
+
"do_lower_case": true,
|
6 |
+
"mask_token": "[MASK]",
|
7 |
+
"model_max_length": 1000000000000000019884624838656,
|
8 |
+
"never_split": null,
|
9 |
+
"pad_token": "[PAD]",
|
10 |
+
"sep_token": "[SEP]",
|
11 |
+
"strip_accents": null,
|
12 |
+
"tokenize_chinese_chars": true,
|
13 |
+
"tokenizer_class": "BertTokenizer",
|
14 |
+
"unk_token": "[UNK]"
|
15 |
+
}
|
trainedSB2/training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1a97024dafc1a94923419f85c63ab33f440e46391ec27dfec53864654217b0a3
|
3 |
+
size 4027
|
trainedSB2/vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|