File size: 3,517 Bytes
b154844
272f871
 
 
 
 
 
 
 
 
 
 
 
 
b154844
272f871
f1c4597
 
 
 
cf04354
f1c4597
da4ec7e
 
 
 
2c693e4
 
da4ec7e
cf04354
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95324d5
cf04354
8d52924
95324d5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
base_model:
- euclaise/Memphis-CoT-3B
license: cc-by-4.0
datasets:
- euclaise/TinyCoT
- euclaise/mathoverflow-accepted
- euclaise/reddit-instruct
- euclaise/WritingPrompts_curated
- sablo/oasst2_curated
- euclaise/mathqa_programs
- BEE-spoke-data/coedit-reworded-deduped
- pszemraj/booksum-short
library_name: transformers
tags:
- supertrainer2000
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/wEaKoLeJzidUdTWwQmA6k.png)

Memphis-scribe 3B alpha is a finetune of [Memphis-CoT 3B](https://huggingface.co/euclaise/Memphis-CoT-3B) on more creative data, which itself is a finetune of [StableLM 3B 4e1t](https://huggingface.co/stabilityai/stablelm-3b-4e1t/).


It is trained further on TinyCoT, but also on
- 10000 comments from [reddit-instruct](https://hf.co/euclaise/reddit-instruct)
- 15000 comments from [writingprompts-curated](https://hf.co/euclaise/writingprompts-curated)
- 4000 examples of [converting MathQA problems to Python snippets](https://hf.co/euclaise/mathqa_programs)
- 3000 examples of [shorter booksum cases (both chapter->summary and summary->chapter tasks)](https://huggingface.co/datasets/pszemraj/booksum-short)
- 10000 examples from [mathoverflow-accepted](https://hf.co/euclaise/mathoverflow-accepted) comments with >10 upvotes
- 1000 examples from [coedit-reworded-deduped](https://huggingface.co/datasets/BEE-spoke-data/coedit-reworded-deduped)

## Training procedure

I started from [Memphis-CoT 3B](https://huggingface.co/euclaise/Memphis-CoT-3B), which used a novel iterative contrastive finetuning procedure to improve reasoning ability.

I directly finetuned it on these examples, using a MixCE loss with a mixing ratio of 0.5.

Finetuning on top of finetunes this way tends to lead to catastrophic forgetting - and indeed I observed significant degregation of the resultant model on e.g. GSM8K.

A common strategy to prevent catastrophic foregtting is weight averaging. In the LM community, 'merges' also utilize weight averaging, and spherical linear interpolation (SLERP) is considered to be superior to linear averaging.  Accordingly, I used spherical SLERP to average the resultant model back with the original Memphis-CoT model.

This resulted in a model that has learned from the new data, without completely forgetting what it has learned from the original Memphis-CoT training.

## Benchmarks

This model performs significantly worse than Memphis-CoT on benchmarks, despite being better suited to chat and creative writing tasks. This is an expected tradeoff.


| Model                                                                      | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
|:---------------------------------------------------------------------------|:---------------|:----------------------------------------|:------------------------------  |
| [StableLM 3B Base](https://hf.co/stabilityai/stablelm-3b-4e1t)             | 2.05%          | 25.14%                                  | 36.75%                          |
| [Memphis-CoT 3B](https://hf.co/euclaise/Memphis-CoT-3B)                    | 13.8%          | 26.24%                                  | 38.24%                          |
| [Memphis-scribe 3B alpha](https://hf.co/euclaise/Memphis-scribe-3B-alpha)  | 12.28%         | 23.92%                                  | 38.1%                           |
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0