Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -62,9 +62,9 @@ Additional information about the datasets will be included in the Meditron-3 pub
62
 
63
  | Model Name | MedmcQA | MedQA | PubmedQA | Average |
64
  |-----------------------------|---------|--------|----------|---------|
65
- | google/gemma-2-2b | 40.31 | 34.80 | 74.20 | 49.77 |
66
- | gemMeditron-2-2b-it | 42.51 | 38.81 | 75.40 | 52.24 |
67
- | Difference (gemMeditron vs.)| 2.20 | 4.01 | 1.20 | 2.47 |
68
 
69
  We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
70
  While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.
 
62
 
63
  | Model Name | MedmcQA | MedQA | PubmedQA | Average |
64
  |-----------------------------|---------|--------|----------|---------|
65
+ | google/gemma-2-2b-it | 42.89 | 44,62 | 74.00 | 53.84 |
66
+ | gemMeditron-2-2b-it | 46.57 | 43.21 | 74,40 | 54.69 |
67
+ | Difference (gemMeditron vs.)| 3.58 | -1.41 | 0.40 | 0.85 |
68
 
69
  We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
70
  While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.