Spaces:
Sleeping
Sleeping
title: LLMGeneLinker (LGL) | |
language: en | |
sdk: gradio | |
tags: | |
- Named Entity Recognition | |
- SciBERT | |
- Drug-Target interaction | |
- Drugs | |
- Genes | |
- Proteins | |
- Medical | |
datasets: | |
- bigbio/ncbi_disease | |
- bigbio/bc5cdr | |
- bigbio/genetag | |
- bigbio/drugprot | |
- allenai/drug-combo-extraction | |
# LLMGeneLinker (LGL): a Fine-Tuned SciBERT Model for Named Entity Recognition | |
LLMGeneLinker uses a domain-specific transformer like SciBERT finetuned on AllenAI drug dataset, BC5CDR disease, NCBI disease, DrugProt and GeneTAG datasets. The resulting SciBERT model performs Named Entity Recognition to tag drug, protein, gene, diseases in input text. Sentence embedding of SciBERT is then fed into BERT | |
## Table of Contents | |
- [Model Overview](#model-overview) | |
- [Usage](#usage) | |
- [Installation](#installation) | |
- [Dataset](#dataset) | |
- [Contributing](#contributing) | |
- [License](#license) | |
## Model Overview | |
The model is based on the [SciBERT](https://github.com/allenai/scibert) architecture, which is a pre-trained language model specifically designed for the biomedical domain. By fine-tuning SciBERT on a labeled dataset, we have created a specialized NER model that can accurately recognize drugs, genes, and diseases in biomedical texts. | |
## Usage | |
You can access an interactive web interface for querying the fine-tuned LGL model [here](spacelink). If you prefer to load the model yourself, you can check out [Installation](#installation) below. | |
## Installation | |
If you prefer to run LGL locally or conduct further fine-tuning, you need to install the required dependencies and download the model files. Follow the steps below to set up the environment: | |
1. Clone this repository to your local machine. | |
1.1 If you do not have Python installed, download python via the official sources. Anaconda is recommended if you use scientific packages often. | |
If using anaconda, after installation setup a new conda environment via the following (replace *myname* with your own choice of environment name): | |
```conda create --name *myname* python==3.8``` | |
2. Activate your venv/ conda env (if using) and install the required Python packages using `pip`: | |
```pip install -r requirements_local.txt``` | |
3. To utilize the fine-tuned NER model for recognizing drugs, genes, and diseases, you can open `demo.ipynb` in Jupyter Lab by starting Jupyter Lab via ```jupyter lab```. The script takes text input as a string and returns the identified entities along with their respective labels. | |
## Dataset | |
The following datasets were processed and used for training and evaluation: | |
Most datasets were sourced from `BigBIO` [GitHub] (https://github.com/bigscience-workshop/biomedical/blob/main/README.md) [HF] (https://huggingface.co/bigbio) | |
| Task Type | Dataset | Links || | |
|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:| | |
| NER | NCBI-disease | [Link](https://huggingface.co/datasets/bigbio/ncbi_disease)| | |
| NER | BC5-disease | [Link](https://huggingface.co/datasets/bigbio/bc5cdr)| | |
| NER | Genetag | [Link](https://huggingface.co/datasets/bigbio/genetag)| | |
| NER/RE | Drugprot | [Link](https://huggingface.co/datasets/bigbio/drugprot)| | |
| NER/RE | AllenAI Drug-Combo-Extraction | [Link](https://huggingface.co/datasets/allenai/drug-combo-extraction)| | |