Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-sa-3.0
|
3 |
+
datasets:
|
4 |
+
- euclaise/TinyCoT
|
5 |
+
- euclaise/reddit-instruct
|
6 |
+
- sablo/oasst2_curated
|
7 |
+
library_name: transformers
|
8 |
+
tags:
|
9 |
+
- supertrainer2000
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
|
14 |
+
|
15 |
+
**Memphis was trained *only* on human data! No GPT generations here.**
|
16 |
+
|
17 |
+
Finetuning was performed using my [supertrainer2000](https://github.com/euclaise/supertrainer2000) framework, using my Adalite optimizer.
|
18 |
+
|
19 |
+
|
20 |
+
### Training Procedure
|
21 |
+
I finetuned the model using an iterative rationale-bootstrapping procedure inspired by [STaR](https://research.google/pubs/star-self-taught-reasoner-bootstrapping-reasoning-with-reasoning/) and [SPIN](https://arxiv.org/abs/2401.01335)
|
22 |
+
|
23 |
+
First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
|
24 |
+
|
25 |
+
I then performed the following steps 3 times:
|
26 |
+
1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
|
27 |
+
2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. A standard CE loss over the ground-truth was included to prevent excessive drift.
|
28 |
+
|
29 |
+
This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
|
30 |
+
|
31 |
+
### Hyperparameters
|
32 |
+
|
33 |
+
For the initial supervised finetuning step:
|
34 |
+
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
|
35 |
+
- Lambda (Adalite's analogue to weight decay) of 0.01
|
36 |
+
- LR of 1e-5
|
37 |
+
- MixCE ratio of 0.75
|
38 |
+
- Sequence length of 4096
|
39 |
+
- Cosine decay with a 20% warmup
|
40 |
+
- Frozen embeddings
|
41 |
+
- No training on inputs
|
42 |
+
- Accumulated batch size of 128
|
43 |
+
- NEFTune with an alpha of 10
|
44 |
+
|
45 |
+
For the generations:
|
46 |
+
- Generated using the current git version of `vllm`
|
47 |
+
- N=8
|
48 |
+
- Temperature of 0.5
|
49 |
+
- `top_p` of 0.8
|
50 |
+
- Maximum of 512 generated tokens, discarding responses that do not have a valid rationale and answer
|
51 |
+
|
52 |
+
For the rank finetuning:
|
53 |
+
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
|
54 |
+
- Lambda of 0.01
|
55 |
+
- LR of 5e-7
|
56 |
+
- Rank loss weight of 5
|
57 |
+
- Sequence length of 1024
|
58 |
+
- Cosine schedule with 10% warmup
|
59 |
+
- Frozen embeddings
|
60 |
+
- No training on inputs
|
61 |
+
- Accumulated batch size of 128
|
62 |
+
- NEFTune with an alpha of 10
|