# Large Language Model Locally Fine-tuning (LLMLF) on Chinese Medical Imaging Reports ## Data Specification
| Medical Image Report Type | Train & Validate Dataset Size (9:1) | Test Dataset Size | Best Train Epochs | |-------------------------------------|:-----------------------------------:|:-----------------:|:------------:| | Endocrine Adrenal CT | 1240 | 110 | 100 | | Hepatobiliary Upper Abdomen CT | 833 | 88 | 16 | | Pancreatic Preoperative Staging CTA | 1228 | 115 | 50 |
## Repository Specification - **model-endocrine-100**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 100 epochs on Endocrine Adrenal CT dataset. - **model-hepatobiliary-16**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 16 epochs on Hepatobiliary Upper Abdomen CT dataset. - **model-pancreatic-50**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 50 epochs on Pancreatic Preoperative Staging CTA dataset. - **samples**: Sample files in csv format. Each model has a sample input and a sample output file. The input file has an `INSTRUCTION` column and a `RESPONSE` column, containing 2 medical image reports and corresponding physicians' clinical advice, respectively, randomly selected from the test dataset. The output file is the same as the input file except for an additional `GENERATED` column, containing clinical advice generated by the model. Users may refer to these sample files when inference using `chatbot.py`. Details will be explained in the next section. - **chatbot.py**: A CLI (command-line interface) for inferencing, modified from [chatbot.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/inference/chatbot.py). Note that unlike [chatbot.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/), we disabled context preservation, assuming that each input report is complete, independent from its previous context. Users may assume that a `clear` command is automatically executed after every single inference. Also, we replaced the default prefix "Human: " and suffix "Assistant: " with "根据下面一段影像描述:" ("According to the following medical image report: ") and "生成一份对应的诊断意见:" ("Generate a piece of clinical advice: "), to better accomodate to Chinese usage. ## Prerequisites (Ours) - python 3.9.13 - pandas 2.0.1 - transformers 4.28.1 ## Getting Started ### Installation ```bash pip install pandas==2.0.1 transformers==4.28.1 # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/junwen-liu/IMIT-MedImg-LLMLF cd IMIT-MedImg-LLMLF ``` ### Basic Usage ```bash python chatbot.py \ --path [directory containing trained model] \ --max_new_tokens [maximum new tokens to generate per response] \ --in_csv [path to the input csv file] \ --out_csv [path to the output csv file] ``` ### Example: Interactive Inferencing on Endocrine Adrenal CT Test Dataset `chatbot.py` allows interactive inferencing that users munually feed medical image reports to it and get clinical advice in a conversational style. - **Step 1**: Get the two Endocrine Adrenal CT reports from the `INSTRUCTION` column of `samples/sample-input-endocrine.csv`, as listed below: ```bash # report 1 左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。 # report 2 左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。 ``` - **Step 2**: Launch CLI: ```bash python chatbot.py \ --path model-endocrine-100/ \ --max_new_tokens 512 ``` - **Step 3**: Enter the first report and wait for the generated clinical advice. - **Step 4**: Enter the second report and wait for the generated clinical advice. - **Step 5**: Enter `quit` to quit inferencing. The overall command-line should look something like below: ```bash $ python chatbot.py \ > --path model-endocrine-100/ \ > --max_new_tokens 512 Enter input (type 'quit' to exit): 左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。 ---------------------------------------- 根据下面一段影像描述:左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。 生成一份对应的诊断意见: 左侧肾上腺体部结节;右肾低密度灶;拟复杂囊肿。附见脂肪肝。请结合临床及其他相关检查;随诊。 Enter input (type 'quit' to exit): 左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。 ---------------------------------------- 根据下面一段影像描述:左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。 生成一份对应的诊断意见: 左侧肾上腺稍增粗。附见肝脏多发血管瘤;肝右叶小囊肿。请结合临床病史及其他相关检查;随访。 Enter input (type 'quit' to exit): quit $ ``` Users may compare the generated clinical advice (the line after " 生成一份对应的诊断意见:") with the sample data in the `GENERATED` column of `samples/sample-output-endocrine.csv`. ### Example: Batch Inferencing on Endocrine Adrenal CT Test Dataset `chatbot.py` also allows batch inferencing that users input a csv file containing all medical image reports and get all clinical advice at once in a generated csv file. - **Step 1**: Prepare a csv file with an `INSTRUCTION` column that contains all medical image reports. Here we use `samples/sample-input-endocrine.csv` as input, which has the following structure: ``` INSTRUCTION,RESPONSE [report 1],[physician advice 1] [report 2],[physician advice 2] ... [report n],[physician advice n] ``` - **Step 2**: Launch CLI: ```bash python chatbot.py \ --path model-endocrine-100/ \ --max_new_tokens 512 \ --in_csv samples/sample-input-endocrine.csv \ --out_csv samples/generated-output-endocrine.csv ``` The overall command-line should look something like below: ```bash $ python chatbot.py \ > --path model-endocrine-100/ \ > --max_new_tokens 512 \ > --in_csv samples/sample-input-endocrine.csv \ > --out_csv samples/generated-output-endocrine.csv -------------------- Instruction 1 -------------------- 根据下面一段影像描述:左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。 生成一份对应的诊断意见: 左侧肾上腺体部结节;右肾低密度灶;拟复杂囊肿。附见脂肪肝。请结合临床及其他相关检查;随诊。 -------------------- Instruction 2 -------------------- 根据下面一段影像描述:左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。 生成一份对应的诊断意见: 左侧肾上腺稍增粗。附见肝脏多发血管瘤;肝右叶小囊肿。请结合临床病史及其他相关检查;随访。 $ ``` In addition, an csv file of the following structure will be generated at `samples/generated-output-endocrine.csv`, which has the following structure: ``` INSTRUCTION,RESPONSE,GENERATED [report 1],[physician advice 1],[generated advice 1] [report 2],[physician advice 2],[generated advice 2] ... [report n],[physician advice n],[generated advice n] ``` Users may compare `samples/generated-output-endocrine.csv` with the sample file `samples/sample-output-endocrine.csv`.