OpenMeditron
/

Meditron3-Gemma2-2B

Text Generation

Model card Files Files and versions Community

Update README.md

#2

by ETraKoZ - opened 12 days ago

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -62,9 +62,9 @@ Additional information about the datasets will be included in the Meditron-3 pub
 | Model Name                  | MedmcQA | MedQA  | PubmedQA | Average |
 |-----------------------------|---------|--------|----------|---------|
-| google/gemma-2-2b           | 40.31   | 34.80  | 74.20    | 49.77   |
-| gemMeditron-2-2b-it         | 42.51   | 38.81  | 75.40    | 52.24   |
-| Difference (gemMeditron vs.)| 2.20    | 4.01   | 1.20     | 2.47    |
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.

 | Model Name                  | MedmcQA | MedQA  | PubmedQA | Average |
 |-----------------------------|---------|--------|----------|---------|
+| google/gemma-2-2b-it        | 42.89   | 44,62  | 74.00    | 53.84   |
+| gemMeditron-2-2b-it         | 46.57   | 43.21  | 74,40    | 54.69   |
+| Difference (gemMeditron vs.)| 3.58    | -1.41  | 0.40     | 0.85    |
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.