Update README.md

bd28ab5 verified 4 months ago

5.83 kB

	---
	library_name: peft
	base_model: mistralai/Mistral-7B-v0.1
	license: apache-2.0
	language:
	- en
	---
	# PathoIE-Mistral-7B

	<img src="https://cdn-uploads.huggingface.co/production/uploads/646704281dd5854d4de2cdda/Rj_swAuOXmAUNWztt6fpp.webp" width="500" />


	## Training:

	Check out our githbub: https://github.com/HIRC-SNUBH/Curation_LLM_PathoReport.git

	- PEFT 0.4.0

	## Inference

	Since the model was trained using instructions following the ChatML template, modifications to the tokenizer are required.

	``` python
	from datasets import load_dataset
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel

	# Load base model
	base_model = AutoModelForCausalLM.from_pretrained(
	'mistralai/Mistral-7B-v0.1',
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16, # Optional, if you have insufficient VRAM, lower the precision.
	)

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1')
	tokenizer.add_special_tokens(dict(
	eos_token=AddedToken("<\|im_end\|>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
	unk_token=AddedToken("<unk>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
	bos_token=AddedToken("<s>", single_word=False, lstrip=False, rstrip=False, normalized=True, special=True),
	pad_token=AddedToken("</s>", single_word=False, lstrip=False, rstrip=False, normalized=False, special=True),
	))
	tokenizer.add_tokens([AddedToken("<\|im_start\|>", single_word=False, lstrip=True, rstrip=True, normalized=False)], special_tokens=True)
	tokenizer.additional_special_tokens = ['<unk>', '<s>', '</s>', '<\|im_end\|>', '<\|im_start\|>']

	model.resize_token_embeddings(len(tokenizer))
	model.config.eos_token_id = tokenizer.eos_token_id

	# Load PEFT
	model = PeftModel.from_pretrained(base_model, 'Lowenzahn/PathoIE-Mistral-7B')
	model = model.merge_and_unload()
	model = model.eval()

	# Inference
	prompts = ["Machine learning is"]
	inputs = tokenizer(prompts, return_tensors="pt")
	gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.0, "do_sample": False, "repetition_penalty": 1.0}
	output = model.generate(inputs['input_ids'], **gen_kwargs)
	output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
	print(output)
	```


	# Prompt example

	The pathology report used below is a fictive example.

	```
	<\|im_start\|> system
	You are a pathologist who specialized in lung cancer.
	Your task is extracting informations requested by the user from the lung cancer pathology report and formatting extracted informations into JSON.
	The information to be extracted is clearly specified in the report, so one must avoid from inferring information that is not present.
	Remember, you MUST answer in JSON only. Avoid any additional explanations. user
	Extract the following informations (value-set) from the report I provide.
	If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.<\|im_end\|>
	<\|im_start\|> user
	Extract the following informations (value-set) from the report I provide.
	If the required information to extract each value in the value-set is not present in the pathology report, consider it as 'not submitted'.
	<value-set>
	- MORPHOLOGY_DIAGNOSIS
	- SUBTYPE_DOMINANT
	- MAX_SIZE_OF_TUMOR(invasive component only)
	- MAX_SIZE_OF_TUMOR(including CIS=AIS)
	- INVASION_TO_VISCERAL_PLEURAL
	- MAIN_BRONCHUS
	- INVASION_TO_CHEST_WALL
	- INVASION_TO_PARIETAL_PLEURA
	- INVASION_TO_PERICARDIUM
	- INVASION_TO_PHRENIC_NERVE
	- TUMOR_SIZE_CNT
	- LUNG_TO_LUNG_METASTASIS
	- INTRAPULMONARY_METASTASIS
	- SATELLITE_TUMOR_LOCATION
	- SEPARATE_TUMOR_LOCATION
	- INVASION_TO_MEDIASTINUM
	- INVASION_TO_DIAPHRAGM
	- INVASION_TO_HEART
	- INVASION_TO_RECURRENT_LARYNGEAL_NERVE
	- INVASION_TO_TRACHEA
	- INVASION_TO_ESOPHAGUS
	- INVASION_TO_SPINE
	- METASTATIC_RIGHT_UPPER_LOBE
	- METASTATIC_RIGHT_MIDDLE_LOBE
	- METASTATIC_RIGHT_LOWER_LOBE
	- METASTATIC_LEFT_UPPER_LOBE
	- METASTATIC_LEFT_LOWER_LOBE
	- INVASION_TO_AORTA
	- INVASION_TO_SVC
	- INVASION_TO_IVC
	- INVASION_TO_PULMONARY_ARTERY
	- INVASION_TO_PULMONARY_VEIN
	- INVASION_TO_CARINA
	- PRIMARY_CANCER_LOCATION_RIGHT_UPPER_LOBE
	- PRIMARY_CANCER_LOCATION_RIGHT_MIDDLE_LOBE
	- PRIMARY_CANCER_LOCATION_RIGHT_LOWER_LOBE
	- PRIMARY_CANCER_LOCATION_LEFT_UPPER_LOBE
	- PRIMARY_CANCER_LOCATION_LEFT_LOWER_LOBE
	- RELATED_TO_ATELECTASIS_OR_OBSTRUCTIVE_PNEUMONITIS
	- PRIMARY_SITE_LATERALITY
	- LYMPH_METASTASIS_SITES
	- NUMER_OF_LYMPH_NODE_META_CASES
	---
	<report>
	[A] Lung, left lower lobe, lobectomy
	1. ADENOSQUAMOUS CARCINOMA [by 2015 WHO classification]
	- other subtype: acinar (50%), lepidic (30%), solid (20%)
	1) Pre-operative / Previous treatment: not done
	2) Histologic grade: moderately differentiated
	3) Size of tumor:
	a. Invasive component only: 3.5 x 2.5 x 1.3 cm, 2.4 x 2.3 x 1.1 cm
	b. Including CIS component: 3.9 x 2.6 x 1.3 cm, 3.8 x 3.1 x 1.2 cm
	4) Extent of invasion
	a. Invasion to visceral pleura: PRESENT (P2)
	b. Invasion to superior vena cava: present
	5) Main bronchus: not submitted
	6) Necrosis: absent
	7) Resection margin: free from carcinoma (safey margin: 1.1 cm)
	8) Lymph node: metastasis in 2 out of 10 regional lymph nodes
	(peribronchial lymph node: 1/3, LN#5,6 :0/1, LN#7:0/3, LN#12: 1/2)
	<\|im_end\|>
	<\|im_start\|> pathologist
	```

	## Citation
	```
	@article{cho2024ie,
	title={Extracting lung cancer staging descriptors from pathology reports: a generative language model approach},
	author={Hyeongmin Cho and Sooyoung Yoo and Borham Kim and Sowon Jang and Leonard Sunwoo and Sanghwan Kim and Donghyoung Lee and Seok Kim and Sejin Nam and Jin-Haeng Chung},
	journal={},
	volume={},
	pages={},
	year={},
	publisher={},
	issn={},
	doi={},
	url={}
	}
	```