# Large Language Model Locally Fine-tuning (LLMLF) on Chinese Medical Imaging Reports
## Data Specification

<center>

| Medical Image Report Type           | Train & Validate Dataset Size (9:1) | Test Dataset Size | Best Train Epochs |
|-------------------------------------|:-----------------------------------:|:-----------------:|:------------:|
| Endocrine Adrenal CT                |                 1240                |        110        |      100     |
| Hepatobiliary Upper Abdomen CT      |                 833                 |         88        |      16      |
| Pancreatic Preoperative Staging CTA |                 1228                |        115        |      50      |

</center>

## Repository Specification

- **model-endocrine-100**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 100 epochs on Endocrine Adrenal CT dataset.

- **model-hepatobiliary-16**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 16 epochs on Hepatobiliary Upper Abdomen CT dataset.

- **model-pancreatic-50**: [bloom-1b1-zh](https://huggingface.co/ckip-joint/bloom-1b1-zh) model, fine-tuned for 50 epochs on Pancreatic Preoperative Staging CTA dataset.

- **samples**: Sample files in csv format. Each model has a sample input and a sample output file. The input file has an `INSTRUCTION` column and a `RESPONSE` column, containing 2 medical image reports and corresponding physicians' clinical advice, respectively, randomly selected from the test dataset. The output file is the same as the input file except for an additional `GENERATED` column, containing clinical advice generated by the model. Users may refer to these sample files when inference using `chatbot.py`. Details will be explained in the next section.

- **chatbot.py**: A CLI (command-line interface) for inferencing, modified from [chatbot.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/inference/chatbot.py). Note that unlike [chatbot.py](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/), <u>we disabled context preservation</u>, assuming that each input report is complete, independent from its previous context. Users may assume that a `clear` command is automatically executed after every single inference. Also, <u>we replaced the default prefix "Human: " and suffix "Assistant: " with "根据下面一段影像描述：" ("According to the following medical image report: ") and "生成一份对应的诊断意见：" ("Generate a piece of clinical advice: ")</u>, to better accomodate to Chinese usage.

## Prerequisites (Ours)

- python 3.9.13
- pandas 2.0.1
- transformers 4.28.1


## Getting Started

### Installation

```bash
pip install pandas==2.0.1 transformers==4.28.1
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/junwen-liu/IMIT-MedImg-LLMLF
cd IMIT-MedImg-LLMLF
```

### Basic Usage

```bash
python chatbot.py \
--path [directory containing trained model] \
--max_new_tokens [maximum new tokens to generate per response] \
--in_csv [path to the input csv file] \
--out_csv [path to the output csv file]
```

### Example: Interactive Inferencing on Endocrine Adrenal CT Test Dataset

`chatbot.py` allows interactive inferencing that users munually feed medical image reports to it and get clinical advice in a conversational style.

- **Step 1**: Get the two Endocrine Adrenal CT reports from the `INSTRUCTION` column of `samples/sample-input-endocrine.csv`, as listed below:

```bash
# report 1
左侧肾上腺体部见大小约为17.8mm×15.7mm结节；密度不均匀；其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
# report 2
左侧肾上腺稍增粗；右侧肾上腺大小、形态、密度未见明显异常；双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常；均匀强化；未见异常密度影；未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤；肝右叶小囊肿。
```

- **Step 2**: Launch CLI:

```bash
python chatbot.py \
--path model-endocrine-100/ \
--max_new_tokens 512
```

- **Step 3**: Enter the first report and wait for the generated clinical advice.

- **Step 4**: Enter the second report and wait for the generated clinical advice.

- **Step 5**: Enter `quit` to quit inferencing.

The overall command-line should look something like below:

```bash
$ python chatbot.py \
> --path model-endocrine-100/ \
> --max_new_tokens 512
Enter input (type 'quit' to exit): 左侧肾上腺体部见大小约为17.8mm×15.7mm结节；密度不均匀；其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
----------------------------------------
根据下面一段影像描述：左侧肾上腺体部见大小约为17.8mm×15.7mm结节；密度不均匀；其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
 生成一份对应的诊断意见：
左侧肾上腺体部结节；右肾低密度灶；拟复杂囊肿。附见脂肪肝。请结合临床及其他相关检查；随诊。

Enter input (type 'quit' to exit): 左侧肾上腺稍增粗；右侧肾上腺大小、形态、密度未见明显异常；双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常；均匀强化；未见异常密度影；未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤；肝右叶小囊肿。
----------------------------------------
根据下面一段影像描述：左侧肾上腺稍增粗；右侧肾上腺大小、形态、密度未见明显异常；双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常；均匀强化；未见异常密度影；未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤；肝右叶小囊肿。
 生成一份对应的诊断意见：
左侧肾上腺稍增粗。附见肝脏多发血管瘤；肝右叶小囊肿。请结合临床病史及其他相关检查；随访。

Enter input (type 'quit' to exit): quit
$
```

Users may compare the generated clinical advice (the line after " 生成一份对应的诊断意见:") with the sample data in the `GENERATED` column of `samples/sample-output-endocrine.csv`.

### Example: Batch Inferencing on Endocrine Adrenal CT Test Dataset

`chatbot.py` also allows batch inferencing that users input a csv file containing all medical image reports and get all clinical advice at once in a generated csv file.

- **Step 1**: Prepare a csv file with an `INSTRUCTION` column that contains all medical image reports. Here we use `samples/sample-input-endocrine.csv` as input, which has the following structure:

```
INSTRUCTION,RESPONSE
[report 1],[physician advice 1]
[report 2],[physician advice 2]
...
[report n],[physician advice n]
```

- **Step 2**: Launch CLI:

```bash
python chatbot.py \
--path model-endocrine-100/ \
--max_new_tokens 512 \
--in_csv samples/sample-input-endocrine.csv \
--out_csv samples/generated-output-endocrine.csv
```

The overall command-line should look something like below:

```bash
$ python chatbot.py \
> --path model-endocrine-100/ \
> --max_new_tokens 512 \
> --in_csv samples/sample-input-endocrine.csv \
> --out_csv samples/generated-output-endocrine.csv
-------------------- Instruction 1 --------------------
根据下面一段影像描述：左侧肾上腺体部见大小约为17.8mm×15.7mm结节；密度不均匀；其内可见斑片状脂肪密度及结节状钙化灶。右侧肾上腺未见明显异常。右肾实质见低密度灶。左肾实质未见异常密度影。双肾轮廓不光整。腹膜后见淋巴结显示。附见肝实质密度减低。
 生成一份对应的诊断意见：
左侧肾上腺体部结节；右肾低密度灶；拟复杂囊肿。附见脂肪肝。请结合临床及其他相关检查；随诊。

-------------------- Instruction 2 --------------------
根据下面一段影像描述：左侧肾上腺稍增粗；右侧肾上腺大小、形态、密度未见明显异常；双侧肾上腺未见明显异常强化区。所见双肾形态大小位置正常；均匀强化；未见异常密度影；未见阳性结石影。肾盂肾盏及近端输尿管未见扩张积水及破坏。腹膜后未见异常增大的淋巴结影。附见肝脏多发血管瘤；肝右叶小囊肿。
 生成一份对应的诊断意见：
左侧肾上腺稍增粗。附见肝脏多发血管瘤；肝右叶小囊肿。请结合临床病史及其他相关检查；随访。

$
```

In addition, an csv file of the following structure will be generated at `samples/generated-output-endocrine.csv`, which has the following structure:

```
INSTRUCTION,RESPONSE,GENERATED
[report 1],[physician advice 1],[generated advice 1]
[report 2],[physician advice 2],[generated advice 2]
...
[report n],[physician advice n],[generated advice n]
```

Users may compare `samples/generated-output-endocrine.csv` with the sample file `samples/sample-output-endocrine.csv`.