Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.19.0
Experiments
Below are the instructions for running experiments using our novel ChestAgentBench and the previous SoTA CheXbench. ChestAgentBench is a comprehensive benchmark containing over 2,500 complex medical queries across 8 diverse categories.
ChestAgentBench
To run gpt-4o on ChestAgentBench, enter the experiments
directory and run the following script:
python benchmark_gpt4o.py
To run llama 3.2 vision 90B on ChestAgentBench, run the following:
python benchmark_llama.py
To run chexagent on ChestAgentBench, run the following:
python benchmark_chexagent.py
To run llava-med on ChestAgentBench, you'll need to clone their repo and copy the following script into it, after you follow their setup instructions.
mv benchmark_llavamed.py ~/LLaVA-Med/llava/serve
python -m llava.serve.benchmark_llavamed --model-name llava-med-v1.5-mistral-7b --controller http://localhost:10000
If you want to inspect the logs, you can run the following. It will select the most recent log file by default.
python inspect_logs.py [optional: log-file] -n [num-logs]
Finally, to analyze results, run:
python analyze_axes.py results/[logfile].json ../benchmark/questions/ --model [gpt4|llama|chexagent|llava-med] --max-questions [optional:int]
CheXbench
To run the models on chexbench, you can use chexbench_gpt4.py
as a reference. You'll need to download the dataset files locally, and upload them for each request. Rad-ReStruct and Open-I use the same set of images, so you can download the NLMCXR.zip
file just once and copy the images to both directories.
You can find the datasets here:
- SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. Save this to
MedMAX/data/slake
. - Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting. Save the images to
MedMAX/data/rad-restruct/images
. - Open-I Service of the National Library of Medicine. Save the images to
MedMAX/data/openi/images
.
Once you're finished, you'll want to fix the paths in the chexbench.json
file to your local paths using the MedMax/data/fix_chexbench.py
script.
Compare Runs
Analyze a single file based on overall accuracy and along different axes
python compare_runs.py results/medmax.json
For a direct evaluation comparing 2 models, on the exact same questions
python compare_runs.py results/medmax.json results/gpt4o.json
For a direct evaluation comparing ALL models, on the exact same questions (add as many model log files as you want).
python compare_runs.py results/medmax.json results/gpt4o.json results/llama.json results/chexagent.json results/llavamed.json