Lowenzahn commited on
Commit
fc27cfd
1 Parent(s): f8ae828

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: mistralai/Mistral-7B-v0.1
4
+ license: apache-2.0
5
+ language:
6
+ - en
7
+ ---
8
+ # PathoIE-Llama-2-7B
9
+
10
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/646704281dd5854d4de2cdda/Rj_swAuOXmAUNWztt6fpp.webp" width="500" />
11
+
12
+
13
+ ## Training:
14
+
15
+ Check out our githbub: https://github.com/HIRC-SNUBH/Curation_LLM_PathoReport.git
16
+
17
+ - PEFT 0.4.0
18
+
19
+ ## Inference
20
+
21
+ Since the model was trained using instructions following the ChatML template, modifications to the tokenizer are required.
22
+
23
+ ``` python
24
+ from datasets import load_dataset
25
+ from transformers import AutoTokenizer, AutoModelForCausalLM
26
+ from peft import PeftModel
27
+
28
+ # Load base model
29
+ base_model = AutoModelForCausalLM.from_pretrained(
30
+ 'mistralai/Mistral-7B-v0.1',
31
+ trust_remote_code=True,
32
+ device_map="auto",
33
+ torch_dtype=torch.bfloat16, # Optional, if you have insufficient VRAM, lower the precision.
34
+ )
35
+
36
+ # Load tokenizer
37
+ tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1')
38
+ tokenizer.add_special_tokens(dict(
39
+ eos_token=AddedToken("<|im_end|>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
40
+ unk_token=AddedToken("<unk>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
41
+ bos_token=AddedToken("<s>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
42
+ pad_token=AddedToken("</s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True),
43
+ ))
44
+ tokenizer.add_tokens([AddedToken("<|im_start|>", single_word=False, lstrip=True, rstrip=True, normalized=False)], special_tokens=True)
45
+ tokenizer.additional_special_tokens = ['<unk>', '<s>', '</s>', '<|im_end|>', '<|im_start|>']
46
+
47
+ model.resize_token_embeddings(len(tokenizer))
48
+ model.config.eos_token_id = tokenizer.eos_token_id
49
+
50
+ # Load PEFT
51
+ model = PeftModel.from_pretrained(base_model, 'Lowenzahn/PathoIE-Mistral-7B')
52
+ model = model.merge_and_unload()
53
+ model = model.eval()
54
+
55
+ # Inference
56
+ prompts = ["Machine learning is"]
57
+ inputs = tokenizer(prompts, return_tensors="pt")
58
+ gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.0, "do_sample": False, "repetition_penalty": 1.0}
59
+ output = model.generate(inputs['input_ids'], **gen_kwargs)
60
+ output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
61
+ print(output)
62
+ ```
63
+
64
+
65
+ # Prompt example
66
+
67
+ The pathology report used below is a fictive example.
68
+
69
+ ```
70
+ <|im_start|> system
71
+ You are a pathologist who specialized in lung cancer.
72
+ Your task is extracting informations requested by the user from the lung cancer pathology report and formatting extracted informations into JSON.
73
+ The information to be extracted is clearly specified in the report, so one must avoid from inferring information that is not present.
74
+ Remember, you MUST answer in JSON only. Avoid any additional explanations. user
75
+ Extract the following informations (value-set) from the report I provide.
76
+ If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.<|im_end|>
77
+ <|im_start|> user
78
+ Extract the following informations (value-set) from the report I provide.
79
+ If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.
80
+ <value-set>
81
+ - MORPHOLOGY_DIAGNOSIS
82
+ - SUBTYPE_DOMINANT
83
+ - MAX_SIZE_OF_TUMOR(invasive component only)
84
+ - MAX_SIZE_OF_TUMOR(including CIS=AIS)
85
+ - INVASION_TO_VISCERAL_PLEURAL
86
+ - MAIN_BRONCHUS
87
+ - INVASION_TO_CHEST_WALL
88
+ - INVASION_TO_PARIETAL_PLEURA
89
+ - INVASION_TO_PERICARDIUM
90
+ - INVASION_TO_PHRENIC_NERVE
91
+ - TUMOR_SIZE_CNT
92
+ - LUNG_TO_LUNG_METASTASIS
93
+ - INTRAPULMONARY_METASTASIS
94
+ - SATELLITE_TUMOR_LOCATION
95
+ - SEPARATE_TUMOR_LOCATION
96
+ - INVASION_TO_MEDIASTINUM
97
+ - INVASION_TO_DIAPHRAGM
98
+ - INVASION_TO_HEART
99
+ - INVASION_TO_RECURRENT_LARYNGEAL_NERVE
100
+ - INVASION_TO_TRACHEA
101
+ - INVASION_TO_ESOPHAGUS
102
+ - INVASION_TO_SPINE
103
+ - METASTATIC_RIGHT_UPPER_LOBE
104
+ - METASTATIC_RIGHT_MIDDLE_LOBE
105
+ - METASTATIC_RIGHT_LOWER_LOBE
106
+ - METASTATIC_LEFT_UPPER_LOBE
107
+ - METASTATIC_LEFT_LOWER_LOBE
108
+ - INVASION_TO_AORTA
109
+ - INVASION_TO_SVC
110
+ - INVASION_TO_IVC
111
+ - INVASION_TO_PULMONARY_ARTERY
112
+ - INVASION_TO_PULMONARY_VEIN
113
+ - INVASION_TO_CARINA
114
+ - PRIMARY_CANCER_LOCATION_RIGHT_UPPER_LOBE
115
+ - PRIMARY_CANCER_LOCATION_RIGHT_MIDDLE_LOBE
116
+ - PRIMARY_CANCER_LOCATION_RIGHT_LOWER_LOBE
117
+ - PRIMARY_CANCER_LOCATION_LEFT_UPPER_LOBE
118
+ - PRIMARY_CANCER_LOCATION_LEFT_LOWER_LOBE
119
+ - RELATED_TO_ATELECTASIS_OR_OBSTRUCTIVE_PNEUMONITIS
120
+ - PRIMARY_SITE_LATERALITY
121
+ - LYMPH_METASTASIS_SITES
122
+ - NUMER_OF_LYMPH_NODE_META_CASES
123
+ ---
124
+ <report>
125
+ [A] Lung, left lower lobe, lobectomy
126
+ 1. ADENOSQUAMOUS CARCINOMA [by 2015 WHO classification]
127
+ - other subtype: acinar (50%), lepidic (30%), solid (20%)
128
+ 1) Pre-operative / Previous treatment: not done
129
+ 2) Histologic grade: moderately differentiated
130
+ 3) Size of tumor:
131
+ a. Invasive component only: 3.5 x 2.5 x 1.3 cm, 2.4 x 2.3 x 1.1 cm
132
+ b. Including CIS component: 3.9 x 2.6 x 1.3 cm, 3.8 x 3.1 x 1.2 cm
133
+ 4) Extent of invasion
134
+ a. Invasion to visceral pleura: PRESENT (P2)
135
+ b. Invasion to superior vena cava: present
136
+ 5) Main bronchus: not submitted
137
+ 6) Necrosis: absent
138
+ 7) Resection margin: free from carcinoma (safey margin: 1.1 cm)
139
+ 8) Lymph node: metastasis in 2 out of 10 regional lymph nodes
140
+ (peribronchial lymph node: 1/3, LN#5,6 :0/1, LN#7:0/3, LN#12: 1/2)
141
+ <|im_end|>
142
+ <|im_start|> pathologist
143
+ ```
144
+
145
+ ## Citation
146
+ ```
147
+ @article{cho2024ie,
148
+ title={Extracting lung cancer staging descriptors from pathology reports: a generative language model approach},
149
+ author={Hyeongmin Cho and Sooyoung Yoo and Borham Kim and Sowon Jang and Leonard Sunwoo and Sanghwan Kim and Donghyoung Lee and Seok Kim and Sejin Nam and Jin-Haeng Chung},
150
+ journal={},
151
+ volume={},
152
+ pages={},
153
+ year={},
154
+ publisher={},
155
+ issn={},
156
+ doi={},
157
+ url={}
158
+ }
159
+ ```