euclaise
/

Memphis-CoT-3B

@@ -58,18 +58,20 @@ The format for TinyCoT was:
 ## Benchmarks
-| Model                                                                  | Size   | Data                | Method        | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) |
-|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|
-| [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b)       | 3B     | Base                | Base          |    2.05%       | 25.14%                                  |
 | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b)                | 3B     | GPT                 | SFT           |    3.64%       | 24.31%                                  |
 | [MPT 7B Instruct](mosaicml/mpt-7b-instruct)                            | **7B** | **Human**+Anthropic | SFT           |    2.05%       | 24.12%                                  |
 | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21%                   |
-| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b)     | 3B     | GPT                 | DPO           |    contaminated (45.72%)  | **33.31%**                              |
-| [**Memphis-CoT 3B**](https://hf.co/euclaise/memphis-cot-3b)            | 3B     | **Human**           | Self-teaching |    **13.8%**       | *26.24%*                                |
-Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, but doesn't quite reach the performance of the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
 Notes:
 - Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.

 ## Benchmarks
+| Model                                                                  | Size   | Data                | Method        | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BigBench Hard (CoT, few-shot*) |
+|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------|
+| [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b)       | 3B     | Base                | Base          |    2.05%       | 25.14%                                  |
 | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b)                | 3B     | GPT                 | SFT           |    3.64%       | 24.31%                                  |
 | [MPT 7B Instruct](mosaicml/mpt-7b-instruct)                            | **7B** | **Human**+Anthropic | SFT           |    2.05%       | 24.12%                                  |
 | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21%                   |
+| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b)     | 3B     | GPT                 | DPO           |    contaminated (45.72%)  | **33.31%**                   | 0.91%                         |
+| [**Memphis-CoT 3B**](https://hf.co/euclaise/memphis-cot-3b)            | 3B     | **Human**           | Self-teaching |    **13.8%**       | *26.24%*                            | 38.24%                        |
+*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
+Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
+It is unclear why Memphis outperforms Zephyr by such a large margin on BBH - I would take this with a grain of salt. That said, it was a consistent trend across all the BBH subsets, not just specific ones.
+In theory there could be data contamination, but I used only the train splits of datasets, and I would expect such contamination to only occur for specific subsets rather than improving scores across nearly all subsets.  It could also be that the contrastive training procedure employed by Memphis is necessary to get good CoT performance from small models.
 Notes:
 - Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.