File size: 2,920 Bytes
d7a7846
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Experiments
Below are the instructions for running experiments using our novel ChestAgentBench and the previous SoTA CheXbench. ChestAgentBench is a comprehensive benchmark containing over 2,500 complex medical queries across 8 diverse categories.

### ChestAgentBench

To run gpt-4o on ChestAgentBench, enter the `experiments` directory and run the following script:
```bash

python benchmark_gpt4o.py

```

To run llama 3.2 vision 90B on ChestAgentBench, run the following:
```bash

python benchmark_llama.py

```

To run chexagent on ChestAgentBench, run the following:
```bash

python benchmark_chexagent.py

```

To run llava-med on ChestAgentBench, you'll need to clone their repo and copy the following script into it, after you follow their setup instructions.
```bash

mv benchmark_llavamed.py ~/LLaVA-Med/llava/serve

python -m llava.serve.benchmark_llavamed --model-name llava-med-v1.5-mistral-7b --controller http://localhost:10000

```

If you want to inspect the logs, you can run the following. It will select the most recent log file by default.
```bash

python inspect_logs.py [optional: log-file] -n [num-logs]

```

Finally, to analyze results, run:
```bash

python analyze_axes.py results/[logfile].json ../benchmark/questions/ --model [gpt4|llama|chexagent|llava-med] --max-questions [optional:int]

```

### CheXbench

To run the models on chexbench, you can use `chexbench_gpt4.py` as a reference. You'll need to download the dataset files locally, and upload them for each request. Rad-ReStruct and Open-I use the same set of images, so you can download the `NLMCXR.zip` file just once and copy the images to both directories.

You can find the datasets here:
1. [SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering](https://www.med-vqa.com/slake/). Save this to `MedMAX/data/slake`.
2. [Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting](https://github.com/ChantalMP/Rad-ReStruct). Save the images to `MedMAX/data/rad-restruct/images`.
3. [Open-I Service of the National Library of Medicine](https://openi.nlm.nih.gov/faq). Save the images to `MedMAX/data/openi/images`.

Once you're finished, you'll want to fix the paths in the `chexbench.json` file to your local paths using the `MedMax/data/fix_chexbench.py` script.


### Compare Runs
Analyze a single file based on overall accuracy and along different axes
```

python compare_runs.py results/medmax.json

```

For a direct evaluation comparing **2** models, on the exact same questions 
```

python compare_runs.py results/medmax.json results/gpt4o.json

```

For a direct evaluation comparing **ALL** models, on the exact same questions (add as many model log files as you want).
```

python compare_runs.py results/medmax.json results/gpt4o.json results/llama.json results/chexagent.json results/llavamed.json

```