Experiments

Below are the instructions for running experiments using our novel ChestAgentBench and the previous SoTA CheXbench. ChestAgentBench is a comprehensive benchmark containing over 2,500 complex medical queries across 8 diverse categories.

ChestAgentBench

To run gpt-4o on ChestAgentBench, enter the experiments directory and run the following script:

python benchmark_gpt4o.py

To run llama 3.2 vision 90B on ChestAgentBench, run the following:

python benchmark_llama.py

To run chexagent on ChestAgentBench, run the following:

python benchmark_chexagent.py

To run llava-med on ChestAgentBench, you'll need to clone their repo and copy the following script into it, after you follow their setup instructions.

mv benchmark_llavamed.py ~/LLaVA-Med/llava/serve
python -m llava.serve.benchmark_llavamed --model-name llava-med-v1.5-mistral-7b --controller http://localhost:10000

If you want to inspect the logs, you can run the following. It will select the most recent log file by default.

python inspect_logs.py [optional: log-file] -n [num-logs]

Finally, to analyze results, run:

python analyze_axes.py results/[logfile].json ../benchmark/questions/ --model [gpt4|llama|chexagent|llava-med] --max-questions [optional:int]

CheXbench

To run the models on chexbench, you can use chexbench_gpt4.py as a reference. You'll need to download the dataset files locally, and upload them for each request. Rad-ReStruct and Open-I use the same set of images, so you can download the NLMCXR.zip file just once and copy the images to both directories.

You can find the datasets here:

SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. Save this to MedMAX/data/slake.
Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting. Save the images to MedMAX/data/rad-restruct/images.
Open-I Service of the National Library of Medicine. Save the images to MedMAX/data/openi/images.

Once you're finished, you'll want to fix the paths in the chexbench.json file to your local paths using the MedMax/data/fix_chexbench.py script.

Compare Runs

Analyze a single file based on overall accuracy and along different axes

python compare_runs.py results/medmax.json

For a direct evaluation comparing 2 models, on the exact same questions

python compare_runs.py results/medmax.json results/gpt4o.json

For a direct evaluation comparing ALL models, on the exact same questions (add as many model log files as you want).

python compare_runs.py results/medmax.json results/gpt4o.json results/llama.json results/chexagent.json results/llavamed.json