Lowenzahn
/

PathoIE-Orca-2-7B

PEFT

English

Model card Files Files and versions Community

Lowenzahn commited on Aug 2

Commit

656fc01

•

1 Parent(s): 8642a11

Update README.md

Browse files

Files changed (1) hide show

README.md +159 -3

README.md CHANGED Viewed

@@ -1,3 +1,159 @@
----
-license: apache-2.0
----

+---
+library_name: peft
+base_model: microsoft/Orca-2-7b
+license: apache-2.0
+language:
+- en
+---
+# PathoIE-Llama-2-7B
+<img src="https://cdn-uploads.huggingface.co/production/uploads/646704281dd5854d4de2cdda/eaHoW0BRCBNmcJs83Y5PI.webp" width="500" />
+## Training:
+Check out our githbub: https://github.com/HIRC-SNUBH/Curation_LLM_PathoReport.git
+- PEFT 0.4.0
+## Inference
+Since the model was trained using instructions following the ChatML template, modifications to the tokenizer are required.
+``` python
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    'microsoft/Orca-2-7b',
+    trust_remote_code=True,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,   # Optional, if you have insufficient VRAM, lower the precision.
+)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained('microsoft/Orca-2-7b')
+tokenizer.add_special_tokens(dict(
+    eos_token=AddedToken("<|im_end|>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
+    unk_token=AddedToken("<unk>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
+    bos_token=AddedToken("<s>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
+    pad_token=AddedToken("</s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True),
+))
+tokenizer.add_tokens([AddedToken("<|im_start|>", single_word=False, lstrip=True, rstrip=True, normalized=False)], special_tokens=True)
+tokenizer.additional_special_tokens = ['<unk>', '<s>', '</s>', '<|im_end|>', '<|im_start|>']
+model.resize_token_embeddings(len(tokenizer))
+model.config.eos_token_id = tokenizer.eos_token_id
+# Load PEFT
+model = PeftModel.from_pretrained(base_model, 'Lowenzahn/PathoIE-Orca-2-7B')
+model = model.merge_and_unload()
+model = model.eval()
+# Inference
+prompts = ["Machine learning is"]
+inputs = tokenizer(prompts, return_tensors="pt")
+gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.0, "do_sample": False, "repetition_penalty": 1.0}
+output = model.generate(inputs['input_ids'], **gen_kwargs)
+output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
+print(output)
+```
+# Prompt example
+The pathology report used below is a fictive example.
+```
+<|im_start|> system
+You are a pathologist who specialized in lung cancer.
+Your task is extracting informations requested by the user from the lung cancer pathology report and formatting extracted informations into JSON.
+The information to be extracted is clearly specified in the report, so one must avoid from inferring information that is not present.
+Remember, you MUST answer in JSON only. Avoid any additional explanations. user
+Extract the following informations (value-set) from the report I provide.
+If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.<|im_end|>
+<|im_start|> user
+Extract the following informations (value-set) from the report I provide.
+If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.
+<value-set>
+- MORPHOLOGY_DIAGNOSIS
+- SUBTYPE_DOMINANT
+- MAX_SIZE_OF_TUMOR(invasive component only)
+- MAX_SIZE_OF_TUMOR(including CIS=AIS)
+- INVASION_TO_VISCERAL_PLEURAL
+- MAIN_BRONCHUS
+- INVASION_TO_CHEST_WALL
+- INVASION_TO_PARIETAL_PLEURA
+- INVASION_TO_PERICARDIUM
+- INVASION_TO_PHRENIC_NERVE
+- TUMOR_SIZE_CNT
+- LUNG_TO_LUNG_METASTASIS
+- INTRAPULMONARY_METASTASIS
+- SATELLITE_TUMOR_LOCATION
+- SEPARATE_TUMOR_LOCATION
+- INVASION_TO_MEDIASTINUM
+- INVASION_TO_DIAPHRAGM
+- INVASION_TO_HEART
+- INVASION_TO_RECURRENT_LARYNGEAL_NERVE
+- INVASION_TO_TRACHEA
+- INVASION_TO_ESOPHAGUS
+- INVASION_TO_SPINE
+- METASTATIC_RIGHT_UPPER_LOBE
+- METASTATIC_RIGHT_MIDDLE_LOBE
+- METASTATIC_RIGHT_LOWER_LOBE
+- METASTATIC_LEFT_UPPER_LOBE
+- METASTATIC_LEFT_LOWER_LOBE
+- INVASION_TO_AORTA
+- INVASION_TO_SVC
+- INVASION_TO_IVC
+- INVASION_TO_PULMONARY_ARTERY
+- INVASION_TO_PULMONARY_VEIN
+- INVASION_TO_CARINA
+- PRIMARY_CANCER_LOCATION_RIGHT_UPPER_LOBE
+- PRIMARY_CANCER_LOCATION_RIGHT_MIDDLE_LOBE
+- PRIMARY_CANCER_LOCATION_RIGHT_LOWER_LOBE
+- PRIMARY_CANCER_LOCATION_LEFT_UPPER_LOBE
+- PRIMARY_CANCER_LOCATION_LEFT_LOWER_LOBE
+- RELATED_TO_ATELECTASIS_OR_OBSTRUCTIVE_PNEUMONITIS
+- PRIMARY_SITE_LATERALITY
+- LYMPH_METASTASIS_SITES
+- NUMER_OF_LYMPH_NODE_META_CASES
+---
+<report>
+[A] Lung, left lower lobe, lobectomy
+1. ADENOSQUAMOUS CARCINOMA [by 2015 WHO classification]
+- other subtype: acinar (50%), lepidic (30%), solid (20%)
+    1) Pre-operative / Previous treatment: not done
+    2) Histologic grade: moderately differentiated
+    3) Size of tumor:
+        a. Invasive component only: 3.5 x 2.5 x 1.3 cm, 2.4 x 2.3 x 1.1 cm
+        b. Including CIS component: 3.9 x 2.6 x 1.3 cm, 3.8 x 3.1 x 1.2 cm
+    4) Extent of invasion
+        a. Invasion to visceral pleura: PRESENT (P2)
+        b. Invasion to superior vena cava: present
+    5) Main bronchus: not submitted
+    6) Necrosis: absent
+    7) Resection margin: free from carcinoma (safey margin: 1.1 cm)
+    8) Lymph node: metastasis in 2 out of 10 regional lymph nodes
+        (peribronchial lymph node: 1/3, LN#5,6 :0/1, LN#7:0/3, LN#12: 1/2)
+<|im_end|>
+<|im_start|> pathologist
+```
+## Citation
+```
+@article{cho2024ie,
+    title={Extracting lung cancer staging descriptors from pathology reports: a generative language model approach},
+    author={Hyeongmin Cho and Sooyoung Yoo and Borham Kim and Sowon Jang and Leonard Sunwoo and Sanghwan Kim and Donghyoung Lee and Seok Kim and Sejin Nam and Jin-Haeng Chung},
+    journal={},
+    volume={},
+    pages={},
+    year={},
+    publisher={},
+    issn={},
+    doi={},
+    url={}
+}
+```