euclaise/Memphis-CoT-3B · Hugging Face

Now with a training bug fixed!

Memphis-CoT is a finetune of StableLM 3b 4e1t on TinyCoT, SciCoT, along with reddit-instruct (subset to 5000 examples, excluding posts with brackets in the title) and a curated subset of oasst2.

Memphis was trained only on human data! No GPT generations here.

Finetuning was performed using my supertrainer2000 framework, using my Adalite optimizer.

Training Procedure

I finetuned the model using an iterative rationale-bootstrapping procedure inspired by STaR and SPIN

First, I finetuned the model on all the datasets using a MixCE loss and NEFTune, for 2 epochs.

I then performed the following steps 3 times:

Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to Preference Ranking Optimization, comparing the correct vs incorrect generated response. Additionally, a standard CE loss over the chosen completion was included.

This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).

To prevent excessive drift, I kept the model weights as a moving average: After each generate+train cycle, I interpolated between the previous model weights and the updated weights using spherical linear interpolation (SLERP), with an interpolation factor of 0.99.

Prompt formats

The format for reddit-instruct and oasst2 was:

### User:
[insert instruction here]
### Assistant:
[insert response here]
### User:
...

The format for TinyCoT was:

### User:
[insert instruction here]
### Rationale:
[insert reasoning here]
### Answer:
[insert direct answer here]

Benchmarks

Model	Size	Data	Method	GSM8K (5-shot)	AGIEval (English/Nous subset, acc_norm)	BIG Bench Hard (CoT, few-shot*)
StableLM 3B Base	3B	Base	Base	2.05%	25.14%	36.75%
StableHermes 3B	3B	GPT	SFT	3.64%	24.31%	37.28%
MPT 7B Instruct	7B	Human+Anthropic	SFT	2.05%	24.12%	11.01%
OpenLLaMA 7B v2 open-instruct	7B	Human (nearly: ecqa is an exception)	SFT	8.64%	23.21%	29.84%
StableLM Zephyr 3B	3B	GPT	DPO	possibly contaminated (45.72%)	33.31%	0.91%
LIMA LLaMA 2 7B	7B	Human	SFT	4.55%	24.55%	36.29%
Memphis-CoT 3B	3B	Human	Self-teaching	18.8%	27.22%	36.92%

*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0

Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and much more of it.

Note that BBH results have wide SEs, sometimes even exceeding 16%.

It is unclear why Zephyr performs so poorly on BBH. Perhaps it is overfit, or maybe there was an issue with vllm.

Notes:

Evaluations were performed using the agieval branch of lm-evaluation-harness (commit 0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298), using the vllm model.
I tried to find human-data-trained StableLM models, but couldn't find any. I did find a few OpenLLaMA models, but they wouldn't load with LM Eval Harness and vllm. (I believe this can be fixed by changing the xformers backend, but I'm too lazy for that)
OpenLLaMA 7B v2 open-instruct is a particularly relevant comparison, as it was trained on a very similar dataset.

Hyperparameters

For the initial supervised finetuning step:

Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
Lambda (Adalite's analogue to weight decay, see here for details) of 0.01
LR of 1e-5
MixCE ratio of 0.75
Sequence length of 4096
Cosine decay with a 20% warmup
Frozen embeddings
No training on inputs
Accumulated batch size of 128
NEFTune with an alpha of 10

For the generations:

Generated using the current git version of vllm
N=8
Temperature of 0.5
top_p of 0.8
Maximum of 512 generated tokens, discarding responses that do not have a valid rationale and answer

For the rank finetuning:

Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
Lambda of 0.01
LR of 5e-7
Rank loss weight of 0.25
Sequence length of 1024
Cosine schedule with 10% warmup
Frozen embeddings
No training on inputs
Accumulated batch size of 128
NEFTune with an alpha of 10

euclaise
/

Memphis-CoT-3B

Training Procedure

Prompt formats

Benchmarks

Hyperparameters

Model tree for euclaise/Memphis-CoT-3B

Datasets used to train euclaise/Memphis-CoT-3B