cchristophe commited on
Commit
0b9cc1b
·
verified ·
1 Parent(s): 89bd376

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -36
README.md CHANGED
@@ -22,14 +22,14 @@ Med42-v2 is a suite of open-access clinical large language models (LLM) instruct
22
 
23
  |Models|Elo Score|
24
  |:---:|:---:|
25
- |Med42-v2-70B| --- |
26
- |Llama3-70B-Instruct| --- |
27
- |GPT4-o| --- |
28
- |Med42-v2-8B| --- |
29
- |Llama3-8B-Instruct| --- |
30
- |Mixtral-8x7b-Instruct| --- |
31
- |OpenBioLLM-70B| --- |
32
- |JSL-MedLlama-3-8B-v2.0| --- |
33
 
34
 
35
  ## Limitations & Safe Use
@@ -129,26 +129,6 @@ The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing P
129
 
130
  ## Evaluation Results
131
 
132
- ### MCQA Evaluation
133
-
134
- Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
135
-
136
- |Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
137
- |---:|:---:|:---:|:---:|:---:|:---:|
138
- |Med42v2-70B|64.36|87.12|73.20|79.10|83.80|
139
- |Med42v2-8B|54.30|75.76|61.34|62.84|67.04|
140
- |OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
141
- |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
142
- |MedGemini*|-|-|-|84.00|-|
143
- |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
144
- |Med42|-|76.72|60.90|61.50|71.85|
145
- |ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
146
- |GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
147
-
148
- **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
149
-
150
- <sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
151
-
152
  ### Open-ended question generation
153
 
154
  To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
@@ -170,19 +150,41 @@ Which response is of higher overall quality in a medical context? Consider:
170
  #### Elo Ratings
171
  |Models|Elo Score|
172
  |:---:|:---:|
173
- |Med42-v2-70B| --- |
174
- |Llama3-70B-Instruct| --- |
175
- |GPT4-o| --- |
176
- |Med42-v2-8B| --- |
177
- |Llama3-8B-Instruct| --- |
178
- |Mixtral-8x7b-Instruct| --- |
179
- |OpenBioLLM-70B| --- |
180
- |JSL-MedLlama-3-8B-v2.0| --- |
181
 
182
  #### Win-rate
183
 
184
  [Include Image]
185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
  ## Accessing Med42 and Reporting Issues
188
 
 
22
 
23
  |Models|Elo Score|
24
  |:---:|:---:|
25
+ |Med42-v2-70B| 1764 |
26
+ |Llama3-70B-Instruct| 1643 |
27
+ |GPT4-o| 1426 |
28
+ |Llama3-8B-Instruct| 1352 |
29
+ |Mixtral-8x7b-Instruct| 970 |
30
+ |Med42-v2-8B| 924 |
31
+ |OpenBioLLM-70B| 657 |
32
+ |JSL-MedLlama-3-8B-v2.0| 447 |
33
 
34
 
35
  ## Limitations & Safe Use
 
129
 
130
  ## Evaluation Results
131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  ### Open-ended question generation
133
 
134
  To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
 
150
  #### Elo Ratings
151
  |Models|Elo Score|
152
  |:---:|:---:|
153
+ |Med42-v2-70B| 1764 |
154
+ |Llama3-70B-Instruct| 1643 |
155
+ |GPT4-o| 1426 |
156
+ |Llama3-8B-Instruct| 1352 |
157
+ |Mixtral-8x7b-Instruct| 970 |
158
+ |Med42-v2-8B| 924 |
159
+ |OpenBioLLM-70B| 657 |
160
+ |JSL-MedLlama-3-8B-v2.0| 447 |
161
 
162
  #### Win-rate
163
 
164
  [Include Image]
165
 
166
+ ### MCQA Evaluation
167
+
168
+ Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
169
+
170
+ |Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
171
+ |---:|:---:|:---:|:---:|:---:|:---:|
172
+ |Med42v2-70B|64.36|87.12|73.20|79.10|83.80|
173
+ |Med42v2-8B|54.30|75.76|61.34|62.84|67.04|
174
+ |OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
175
+ |GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
176
+ |MedGemini*|-|-|-|84.00|-|
177
+ |Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
178
+ |Med42|-|76.72|60.90|61.50|71.85|
179
+ |ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
180
+ |GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
181
+ |Llama3-8B-Instruct|-|-|-|-|-|
182
+ |Llama3-70B-Instruct|-|-|-|-|-|
183
+
184
+ **For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
185
+
186
+ <sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
187
+
188
 
189
  ## Accessing Med42 and Reporting Issues
190