oldcai's picture
Upload folder using huggingface_hub
d7a7846 verified
|
raw
history blame
2.92 kB
# Experiments
Below are the instructions for running experiments using our novel ChestAgentBench and the previous SoTA CheXbench. ChestAgentBench is a comprehensive benchmark containing over 2,500 complex medical queries across 8 diverse categories.
### ChestAgentBench
To run gpt-4o on ChestAgentBench, enter the `experiments` directory and run the following script:
```bash
python benchmark_gpt4o.py
```
To run llama 3.2 vision 90B on ChestAgentBench, run the following:
```bash
python benchmark_llama.py
```
To run chexagent on ChestAgentBench, run the following:
```bash
python benchmark_chexagent.py
```
To run llava-med on ChestAgentBench, you'll need to clone their repo and copy the following script into it, after you follow their setup instructions.
```bash
mv benchmark_llavamed.py ~/LLaVA-Med/llava/serve
python -m llava.serve.benchmark_llavamed --model-name llava-med-v1.5-mistral-7b --controller http://localhost:10000
```
If you want to inspect the logs, you can run the following. It will select the most recent log file by default.
```bash
python inspect_logs.py [optional: log-file] -n [num-logs]
```
Finally, to analyze results, run:
```bash
python analyze_axes.py results/[logfile].json ../benchmark/questions/ --model [gpt4|llama|chexagent|llava-med] --max-questions [optional:int]
```
### CheXbench
To run the models on chexbench, you can use `chexbench_gpt4.py` as a reference. You'll need to download the dataset files locally, and upload them for each request. Rad-ReStruct and Open-I use the same set of images, so you can download the `NLMCXR.zip` file just once and copy the images to both directories.
You can find the datasets here:
1. [SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering](https://www.med-vqa.com/slake/). Save this to `MedMAX/data/slake`.
2. [Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting](https://github.com/ChantalMP/Rad-ReStruct). Save the images to `MedMAX/data/rad-restruct/images`.
3. [Open-I Service of the National Library of Medicine](https://openi.nlm.nih.gov/faq). Save the images to `MedMAX/data/openi/images`.
Once you're finished, you'll want to fix the paths in the `chexbench.json` file to your local paths using the `MedMax/data/fix_chexbench.py` script.
### Compare Runs
Analyze a single file based on overall accuracy and along different axes
```
python compare_runs.py results/medmax.json
```
For a direct evaluation comparing **2** models, on the exact same questions
```
python compare_runs.py results/medmax.json results/gpt4o.json
```
For a direct evaluation comparing **ALL** models, on the exact same questions (add as many model log files as you want).
```
python compare_runs.py results/medmax.json results/gpt4o.json results/llama.json results/chexagent.json results/llavamed.json
```