Update README.md
Browse files
README.md
CHANGED
@@ -73,7 +73,7 @@ Memphis outperforms human-data models that are over twice its size, along with S
|
|
73 |
Note that BBH results have wide SEs, exceeding 16%.
|
74 |
|
75 |
|
76 |
-
It is unclear why Zephyr performs so poorly on BBH. Perhaps it is overfit.
|
77 |
|
78 |
Notes:
|
79 |
- Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.
|
|
|
73 |
Note that BBH results have wide SEs, exceeding 16%.
|
74 |
|
75 |
|
76 |
+
It is unclear why Zephyr performs so poorly on BBH. Perhaps it is overfit, or maybe there was an issue with vllm.
|
77 |
|
78 |
Notes:
|
79 |
- Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.
|