# Large Language Model Locally Fine-tuning (LLMLF) on Chinese Medical Imaging Reports
## Data Specification
| Medical Image Report Type | Train & Validate Dataset Size (9:1) | Test Dataset Size | Best Train Epochs |
|-------------------------------------|:-----------------------------------:|:-----------------:|:------------:|
| Endocrine Adrenal CT | 1240 | 110 | 100 |
| Hepatobiliary Upper Abdomen CT | 833 | 88 | 16 |
| Pancreatic Preoperative Staging CTA | 1228 | 115 | 50 |
## Repository Specification
- **model-endocrine-100**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 100 epochs on Endocrine Adrenal CT dataset.
- **model-hepatobiliary-16**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 16 epochs on Hepatobiliary Upper Abdomen CT dataset.
- **model-pancreatic-50**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 50 epochs on Pancreatic Preoperative Staging CTA dataset.
- **samples**: Sample files in csv format. Each model has a sample input and a sample output file. The input file has an `INSTRUCTION` column and a `RESPONSE` column, containing 2 medical image reports and corresponding physicians' clinical advice, respectively, randomly selected from the test dataset. The output file is the same as the input file except for an additional `GENERATED` column, containing clinical advice generated by the model. Users may refer to these sample files when inference using `chatbot.py`. Details will be explained in the next section.
- **chatbot.py**: A CLI (command-line interface) for inferencing, modified from [chatbot.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/inference/chatbot.py). Note that unlike [chatbot.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/), we disabled context preservation, assuming that each input report is complete, independent from its previous context. Users may assume that a `clear` command is automatically executed after every single inference. Also, we replaced the default prefix "Human: " and suffix "Assistant: " with "根据下面一段影像描述:" ("According to the following medical image report: ") and "生成一份对应的诊断意见:" ("Generate a piece of clinical advice: "), to better accomodate to Chinese usage.
## Prerequisites (Ours)
- python 3.9.13
- pandas 2.0.1
- transformers 4.28.1
## Getting Started
### Installation
```bash
pip install pandas==2.0.1 transformers==4.28.1
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/junwen-liu/IMIT-MedImg-LLMLF
cd IMIT-MedImg-LLMLF
```
### Basic Usage
```bash
python chatbot.py \
--path [directory containing trained model] \
--max_new_tokens [maximum new tokens to generate per response] \
--in_csv [path to the input csv file] \
--out_csv [path to the output csv file]
```
### Example: Interactive Inferencing on Endocrine Adrenal CT Test Dataset
`chatbot.py` allows interactive inferencing that users munually feed medical image reports to it and get clinical advice in a conversational style.
- **Step 1**: Get the two Endocrine Adrenal CT reports from the `INSTRUCTION` column of `samples/sample-input-endocrine.csv`, as listed below:
```bash
# report 1
左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
# report 2
左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。
```
- **Step 2**: Launch CLI:
```bash
python chatbot.py \
--path model-endocrine-100/ \
--max_new_tokens 512
```
- **Step 3**: Enter the first report and wait for the generated clinical advice.
- **Step 4**: Enter the second report and wait for the generated clinical advice.
- **Step 5**: Enter `quit` to quit inferencing.
The overall command-line should look something like below:
```bash
$ python chatbot.py \
> --path model-endocrine-100/ \
> --max_new_tokens 512
Enter input (type 'quit' to exit): 左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
----------------------------------------
根据下面一段影像描述:左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
生成一份对应的诊断意见:
左侧肾上腺体部结节;右肾低密度灶;拟复杂囊肿。附见脂肪肝。请结合临床及其他相关检查;随诊。
Enter input (type 'quit' to exit): 左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。
----------------------------------------
根据下面一段影像描述:左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。
生成一份对应的诊断意见:
左侧肾上腺稍增粗。附见肝脏多发血管瘤;肝右叶小囊肿。请结合临床病史及其他相关检查;随访。
Enter input (type 'quit' to exit): quit
$
```
Users may compare the generated clinical advice (the line after " 生成一份对应的诊断意见:") with the sample data in the `GENERATED` column of `samples/sample-output-endocrine.csv`.
### Example: Batch Inferencing on Endocrine Adrenal CT Test Dataset
`chatbot.py` also allows batch inferencing that users input a csv file containing all medical image reports and get all clinical advice at once in a generated csv file.
- **Step 1**: Prepare a csv file with an `INSTRUCTION` column that contains all medical image reports. Here we use `samples/sample-input-endocrine.csv` as input, which has the following structure:
```
INSTRUCTION,RESPONSE
[report 1],[physician advice 1]
[report 2],[physician advice 2]
...
[report n],[physician advice n]
```
- **Step 2**: Launch CLI:
```bash
python chatbot.py \
--path model-endocrine-100/ \
--max_new_tokens 512 \
--in_csv samples/sample-input-endocrine.csv \
--out_csv samples/generated-output-endocrine.csv
```
The overall command-line should look something like below:
```bash
$ python chatbot.py \
> --path model-endocrine-100/ \
> --max_new_tokens 512 \
> --in_csv samples/sample-input-endocrine.csv \
> --out_csv samples/generated-output-endocrine.csv
-------------------- Instruction 1 --------------------
根据下面一段影像描述:左侧肾上腺体部见大小约为17.8mm×15.7mm结节;密度不均匀;其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
生成一份对应的诊断意见:
左侧肾上腺体部结节;右肾低密度灶;拟复杂囊肿。附见脂肪肝。请结合临床及其他相关检查;随诊。
-------------------- Instruction 2 --------------------
根据下面一段影像描述:左侧肾上腺稍增粗;右侧肾上腺大小、形态、密度未见明显异常;双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常;均匀强化;未见异常密度影;未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤;肝右叶小囊肿。
生成一份对应的诊断意见:
左侧肾上腺稍增粗。附见肝脏多发血管瘤;肝右叶小囊肿。请结合临床病史及其他相关检查;随访。
$
```
In addition, an csv file of the following structure will be generated at `samples/generated-output-endocrine.csv`, which has the following structure:
```
INSTRUCTION,RESPONSE,GENERATED
[report 1],[physician advice 1],[generated advice 1]
[report 2],[physician advice 2],[generated advice 2]
...
[report n],[physician advice n],[generated advice n]
```
Users may compare `samples/generated-output-endocrine.csv` with the sample file `samples/sample-output-endocrine.csv`.