|
--- |
|
library_name: peft |
|
base_model: mistralai/Mistral-7B-v0.1 |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
# PathoIE-Mistral-7B |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/646704281dd5854d4de2cdda/Rj_swAuOXmAUNWztt6fpp.webp" width="500" /> |
|
|
|
|
|
## Training: |
|
|
|
Check out our githbub: https://github.com/HIRC-SNUBH/Curation_LLM_PathoReport.git |
|
|
|
- PEFT 0.4.0 |
|
|
|
## Inference |
|
|
|
Since the model was trained using instructions following the ChatML template, modifications to the tokenizer are required. |
|
|
|
``` python |
|
from datasets import load_dataset |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from peft import PeftModel |
|
|
|
# Load base model |
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
'mistralai/Mistral-7B-v0.1', |
|
trust_remote_code=True, |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16, # Optional, if you have insufficient VRAM, lower the precision. |
|
) |
|
|
|
# Load tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1') |
|
tokenizer.add_special_tokens(dict( |
|
eos_token=AddedToken("<|im_end|>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True), |
|
unk_token=AddedToken("<unk>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True), |
|
bos_token=AddedToken("<s>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True), |
|
pad_token=AddedToken("</s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True), |
|
)) |
|
tokenizer.add_tokens([AddedToken("<|im_start|>", single_word=False, lstrip=True, rstrip=True, normalized=False)], special_tokens=True) |
|
tokenizer.additional_special_tokens = ['<unk>', '<s>', '</s>', '<|im_end|>', '<|im_start|>'] |
|
|
|
model.resize_token_embeddings(len(tokenizer)) |
|
model.config.eos_token_id = tokenizer.eos_token_id |
|
|
|
# Load PEFT |
|
model = PeftModel.from_pretrained(base_model, 'Lowenzahn/PathoIE-Mistral-7B') |
|
model = model.merge_and_unload() |
|
model = model.eval() |
|
|
|
# Inference |
|
prompts = ["Machine learning is"] |
|
inputs = tokenizer(prompts, return_tensors="pt") |
|
gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.0, "do_sample": False, "repetition_penalty": 1.0} |
|
output = model.generate(inputs['input_ids'], **gen_kwargs) |
|
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True) |
|
print(output) |
|
``` |
|
|
|
|
|
# Prompt example |
|
|
|
The pathology report used below is a fictive example. |
|
|
|
``` |
|
<|im_start|> system |
|
You are a pathologist who specialized in lung cancer. |
|
Your task is extracting informations requested by the user from the lung cancer pathology report and formatting extracted informations into JSON. |
|
The information to be extracted is clearly specified in the report, so one must avoid from inferring information that is not present. |
|
Remember, you MUST answer in JSON only. Avoid any additional explanations. user |
|
Extract the following informations (value-set) from the report I provide. |
|
If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.<|im_end|> |
|
<|im_start|> user |
|
Extract the following informations (value-set) from the report I provide. |
|
If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'. |
|
<value-set> |
|
- MORPHOLOGY_DIAGNOSIS |
|
- SUBTYPE_DOMINANT |
|
- MAX_SIZE_OF_TUMOR(invasive component only) |
|
- MAX_SIZE_OF_TUMOR(including CIS=AIS) |
|
- INVASION_TO_VISCERAL_PLEURAL |
|
- MAIN_BRONCHUS |
|
- INVASION_TO_CHEST_WALL |
|
- INVASION_TO_PARIETAL_PLEURA |
|
- INVASION_TO_PERICARDIUM |
|
- INVASION_TO_PHRENIC_NERVE |
|
- TUMOR_SIZE_CNT |
|
- LUNG_TO_LUNG_METASTASIS |
|
- INTRAPULMONARY_METASTASIS |
|
- SATELLITE_TUMOR_LOCATION |
|
- SEPARATE_TUMOR_LOCATION |
|
- INVASION_TO_MEDIASTINUM |
|
- INVASION_TO_DIAPHRAGM |
|
- INVASION_TO_HEART |
|
- INVASION_TO_RECURRENT_LARYNGEAL_NERVE |
|
- INVASION_TO_TRACHEA |
|
- INVASION_TO_ESOPHAGUS |
|
- INVASION_TO_SPINE |
|
- METASTATIC_RIGHT_UPPER_LOBE |
|
- METASTATIC_RIGHT_MIDDLE_LOBE |
|
- METASTATIC_RIGHT_LOWER_LOBE |
|
- METASTATIC_LEFT_UPPER_LOBE |
|
- METASTATIC_LEFT_LOWER_LOBE |
|
- INVASION_TO_AORTA |
|
- INVASION_TO_SVC |
|
- INVASION_TO_IVC |
|
- INVASION_TO_PULMONARY_ARTERY |
|
- INVASION_TO_PULMONARY_VEIN |
|
- INVASION_TO_CARINA |
|
- PRIMARY_CANCER_LOCATION_RIGHT_UPPER_LOBE |
|
- PRIMARY_CANCER_LOCATION_RIGHT_MIDDLE_LOBE |
|
- PRIMARY_CANCER_LOCATION_RIGHT_LOWER_LOBE |
|
- PRIMARY_CANCER_LOCATION_LEFT_UPPER_LOBE |
|
- PRIMARY_CANCER_LOCATION_LEFT_LOWER_LOBE |
|
- RELATED_TO_ATELECTASIS_OR_OBSTRUCTIVE_PNEUMONITIS |
|
- PRIMARY_SITE_LATERALITY |
|
- LYMPH_METASTASIS_SITES |
|
- NUMER_OF_LYMPH_NODE_META_CASES |
|
--- |
|
<report> |
|
[A] Lung, left lower lobe, lobectomy |
|
1. ADENOSQUAMOUS CARCINOMA [by 2015 WHO classification] |
|
- other subtype: acinar (50%), lepidic (30%), solid (20%) |
|
1) Pre-operative / Previous treatment: not done |
|
2) Histologic grade: moderately differentiated |
|
3) Size of tumor: |
|
a. Invasive component only: 3.5 x 2.5 x 1.3 cm, 2.4 x 2.3 x 1.1 cm |
|
b. Including CIS component: 3.9 x 2.6 x 1.3 cm, 3.8 x 3.1 x 1.2 cm |
|
4) Extent of invasion |
|
a. Invasion to visceral pleura: PRESENT (P2) |
|
b. Invasion to superior vena cava: present |
|
5) Main bronchus: not submitted |
|
6) Necrosis: absent |
|
7) Resection margin: free from carcinoma (safey margin: 1.1 cm) |
|
8) Lymph node: metastasis in 2 out of 10 regional lymph nodes |
|
(peribronchial lymph node: 1/3, LN#5,6 :0/1, LN#7:0/3, LN#12: 1/2) |
|
<|im_end|> |
|
<|im_start|> pathologist |
|
``` |
|
|
|
## Citation |
|
``` |
|
@article{cho2024ie, |
|
title={Extracting lung cancer staging descriptors from pathology reports: a generative language model approach}, |
|
author={Hyeongmin Cho and Sooyoung Yoo and Borham Kim and Sowon Jang and Leonard Sunwoo and Sanghwan Kim and Donghyoung Lee and Seok Kim and Sejin Nam and Jin-Haeng Chung}, |
|
journal={}, |
|
volume={}, |
|
pages={}, |
|
year={}, |
|
publisher={}, |
|
issn={}, |
|
doi={}, |
|
url={} |
|
} |
|
``` |