hiverlab-nicholastkb commited on
Commit
c5e9744
·
1 Parent(s): 1bb4a69

Public commit

Browse files
README.md CHANGED
@@ -1,12 +1,52 @@
1
- ---
2
- title: LLMGeneLinker LGL V1
3
- emoji: 👀
4
- colorFrom: indigo
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 3.39.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLMGeneLinker (LGL): a Fine-Tuned SciBERT Model for Named Entity Recognition
2
+
3
+ LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT
4
+
5
+ ## Table of Contents
6
+
7
+ - [Model Overview](#model-overview)
8
+ - [Usage](#usage)
9
+ - [Installation](#installation)
10
+
11
+ - [Dataset](#dataset)
12
+
13
+ - [Contributing](#contributing)
14
+ - [License](#license)
15
+
16
+ ## Model Overview
17
+
18
+ The model is based on the [SciBERT](https://github.com/allenai/scibert) architecture, which is a pre-trained language model specifically designed for the biomedical domain. By fine-tuning SciBERT on a labeled dataset, we have created a specialized NER model that can accurately recognize drugs, genes, and diseases in biomedical texts.
19
+
20
+ ## Usage
21
+ You can access an interactive web interface for querying the fine-tuned LGL model [here](spacelink). If you prefer to load the model yourself, you can check out [Installation](#installation) below.
22
+
23
+ ## Installation
24
+ To use LGL, you need to install the required dependencies and download the model files. Follow the steps below to set up the environment:
25
+
26
+ 1. Clone this repository to your local machine.
27
+ 1.1 If you do not have Python installed, download python via the official sources. Anaconda is recommended if you use scientific packages often.
28
+
29
+ If using anaconda, after installation setup a new conda environment via the following (replace *myname* with your own choice of environment name):
30
+ ```conda create --name *myname* python==3.8```
31
+
32
+ 2. Activate your venv/ conda env (if using) and install the required Python packages using `pip`:
33
+
34
+ ```pip install -r requirements.txt```
35
+
36
+ 3. To utilize the fine-tuned NER model for recognizing drugs, genes, and diseases, you can open `demo.ipynb` in Jupyter Lab by starting Jupyter Lab via ```jupyter lab```. The script takes text input as a string and returns the identified entities along with their respective labels.
37
+
38
+ ## Dataset
39
+
40
+ The following datasets were processed and used for training and evaluation:
41
+ Most datasets were sourced from `BigBIO` [GitHub] (https://github.com/bigscience-workshop/biomedical/blob/main/README.md) [HF] (https://huggingface.co/bigbio)
42
+
43
+ | Task Type | Dataset | Links ||
44
+ |:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
45
+ | NER | NCBI-disease | [Link](https://huggingface.co/datasets/bigbio/ncbi_disease)|
46
+ | NER | BC5-disease | [Link](https://huggingface.co/datasets/bigbio/bc5cdr)|
47
+ | NER | Genetag | [Link](https://huggingface.co/datasets/bigbio/genetag)|
48
+ | NER/RE | Drugprot | [Link](https://huggingface.co/datasets/bigbio/drugprot)|
49
+ | NER/RE | AllenAI Drug-Combo-Extraction | [Link](https://huggingface.co/datasets/allenai/drug-combo-extraction)|
50
+
51
+
52
+
app.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # In[1]:
5
+
6
+
7
+ from datasets import Dataset, ClassLabel, Sequence, load_dataset, load_metric
8
+ import numpy as np
9
+ import pandas as pd
10
+ import bioc
11
+ from spacy import displacy
12
+ import transformers
13
+ #import evaluate
14
+ from transformers import (AutoModelForTokenClassification,
15
+ AutoTokenizer,
16
+ DataCollatorForTokenClassification,
17
+ pipeline,
18
+ TrainingArguments,
19
+ Trainer)
20
+
21
+
22
+ # In[2]:
23
+
24
+
25
+ label_list = ['O', 'B-DRUG', 'I-DRUG', 'B-DISEASE', 'I-DISEASE', 'B-GENE', 'I-GENE']
26
+ model_checkpoint = './trainedSB2'
27
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
28
+ model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
29
+ effect_ner_model = pipeline(task="ner", model=model, tokenizer=tokenizer)
30
+
31
+
32
+ # In[21]:
33
+
34
+
35
+ def visualize_entities(sentence):
36
+ tokens = effect_ner_model(sentence)
37
+ entities = []
38
+ # ['O', 'B-DRUG', 'I-DRUG', 'B-DISEASE', 'I-DISEASE', 'B-GENE', 'I-GENE']
39
+ for token in tokens:
40
+ label = int(token["entity"][-1])
41
+ if label != 0:
42
+ token["label"] = label_list[label]
43
+ entities.append(token)
44
+
45
+ params = [{"text": sentence,
46
+ "ents": entities,
47
+ "title": None}]
48
+
49
+ html = displacy.render(params, style="ent", manual=True, options={
50
+ "colors": {
51
+
52
+ "B-DRUG": "#f08080",
53
+ "I-DRUG": "#f08080",
54
+ "B-DISEASE": "#9bddff",
55
+ "I-DISEASE": "#9bddff",
56
+ "B-GENE": "#008080",
57
+ "I-GENE": "#008080",
58
+ },
59
+ })
60
+ return html
61
+
62
+
63
+
64
+ # In[25]:
65
+
66
+
67
+ import gradio as gr
68
+
69
+ exampleList = [
70
+ 'Famotidine is a histamine H2-receptor antagonist used in inpatient settings for prevention of stress ulcers and is showing increasing popularity because of its low cost.',
71
+ 'A randomized Phase III trial demonstrated noninferiority of APF530 500 mg SC ( granisetron 10 mg ) to intravenous palonosetron 0.25 mg in preventing CINV in patients receiving MEC or HEC in acute ( 0 - 24 hours ) and delayed ( 24 - 120 hours ) settings , with activity over 120 hours .',
72
+ 'What are the known interactions between Aspirin and the COX-1 enzyme?',
73
+ 'Can you explain the mechanism of action of Metformin and its effect on the AMPK pathway?',
74
+ 'Are there any genetic variations in the CYP2C9 gene that may influence the response to Warfarin therapy?',
75
+ 'I am curious about the role of Herceptin in targeting the HER2/neu protein in breast cancer treatment. How does it work?',
76
+ 'What are the common side effects associated with Lisinopril, an angiotensin-converting enzyme (ACE) inhibitor?',
77
+ 'Can you explain the significance of the BCR-ABL fusion protein in the context of Imatinib therapy for chronic myeloid leukemia (CML)?',
78
+ 'How does Ibuprofen affect the COX-2 enzyme compared to COX-1?',
79
+ 'Are there any recent studies exploring the use of Pembrolizumab as an immune checkpoint inhibitor targeting PD-1?',
80
+ 'I have heard about the SLC6A4 gene and its association with serotonin reuptake inhibitors (SSRIs) like Fluoxetine.',
81
+ 'Could you provide insights into the BRAF mutation and its relevance in response to Vemurafenib treatment in melanoma patients?'
82
+ ]
83
+
84
+ footer = """
85
+ LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT
86
+ This was made during the <a target="_blank" href =https://www.sginnovate.com/event/hackathon-large-language-models-bio> LLMs for Bio Hackathon</a> organised by 4Catalyzer and SGInnovate.
87
+ <br>
88
+ Made by Team GeneLink (<a target="_blank" href=https://www.linkedin.com/in/ntkb/>Nicholas</a>, <a target="_blank" href=https://www.linkedin.com/in/yewchong-sim/>Yew Chong</a>, <a target="_blank" href=https://www.linkedin.com/in/lim-ting-wei-021383175/>Ting Wei</a>, <a target="_blank" href=https://www.linkedin.com/in/brendan-lim-ciwen/>Brendan</a>
89
+ <hr>
90
+ Note: Performance is noted to be poorer on genes, acronyms, and receptors (named entities that may be targets for drugs or genes)
91
+ Original notebook adapted from <a target="_blank" href=https://huggingface.co/jsylee/scibert_scivocab_uncased-finetuned-ner>jsylee/scibert_scivocab_uncased-finetuned-ner</a>
92
+ """
93
+
94
+ with gr.Blocks() as demo:
95
+ gr.Markdown("## LLMGeneLinker (LGL)")
96
+ gr.Markdown(footer)
97
+
98
+ txt = gr.Textbox(label="Input", lines=2)
99
+ txt_3 = gr.HTML(label="Output")
100
+ btn = gr.Button(value="Submit")
101
+ btn.click(visualize_entities, inputs=txt, outputs=txt_3)
102
+
103
+ gr.Markdown("## Text Examples")
104
+ gr.Examples(
105
+ [[x] for x in exampleList],
106
+ txt,
107
+ txt_3,
108
+ visualize_entities,
109
+ cache_examples=False,
110
+ run_on_click=True
111
+ )
112
+
113
+
114
+ if __name__ == "__main__":
115
+ demo.launch()
116
+
117
+
118
+ # In[ ]:
119
+
120
+
121
+
122
+
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ huggingface_hub
2
+ jupyterlab
3
+ datasets
4
+ transformers[torch]
5
+ seqeval
6
+ spacy
7
+ evaluate
8
+ bioc
9
+ gradio
trainedSB2/config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "allenai/scibert_scivocab_uncased",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2",
15
+ "3": "LABEL_3",
16
+ "4": "LABEL_4",
17
+ "5": "LABEL_5",
18
+ "6": "LABEL_6"
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "label2id": {
23
+ "LABEL_0": 0,
24
+ "LABEL_1": 1,
25
+ "LABEL_2": 2,
26
+ "LABEL_3": 3,
27
+ "LABEL_4": 4,
28
+ "LABEL_5": 5,
29
+ "LABEL_6": 6
30
+ },
31
+ "layer_norm_eps": 1e-12,
32
+ "max_position_embeddings": 512,
33
+ "model_type": "bert",
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 12,
36
+ "pad_token_id": 0,
37
+ "position_embedding_type": "absolute",
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.31.0",
40
+ "type_vocab_size": 2,
41
+ "use_cache": true,
42
+ "vocab_size": 31090
43
+ }
trainedSB2/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b09b5b25e4643e98d3404b02ad8e83801ef9c7025d1904e59d2f8d76e65eb883
3
+ size 437400745
trainedSB2/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
trainedSB2/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
trainedSB2/tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "clean_up_tokenization_spaces": true,
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": true,
6
+ "mask_token": "[MASK]",
7
+ "model_max_length": 1000000000000000019884624838656,
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "strip_accents": null,
12
+ "tokenize_chinese_chars": true,
13
+ "tokenizer_class": "BertTokenizer",
14
+ "unk_token": "[UNK]"
15
+ }
trainedSB2/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a97024dafc1a94923419f85c63ab33f440e46391ec27dfec53864654217b0a3
3
+ size 4027
trainedSB2/vocab.txt ADDED
The diff for this file is too large to render. See raw diff