euclaise
/

Memphis-CoT-3B

Text Generation

supertrainer2000

Model card Files Files and versions Community

Memphis-CoT-3B / README.md

euclaise's picture

Create README.md

f58fb74 verified 12 months ago

|

3 kB

	---
	license: cc-by-sa-3.0
	datasets:
	- euclaise/TinyCoT
	- euclaise/reddit-instruct
	- sablo/oasst2_curated
	library_name: transformers
	tags:
	- supertrainer2000
	---


	Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).

	*Memphis was trained only* on human data! No GPT generations here.**

	Finetuning was performed using my [supertrainer2000](https://github.com/euclaise/supertrainer2000) framework, using my Adalite optimizer.


	### Training Procedure
	I finetuned the model using an iterative rationale-bootstrapping procedure inspired by [STaR](https://research.google/pubs/star-self-taught-reasoner-bootstrapping-reasoning-with-reasoning/) and [SPIN](https://arxiv.org/abs/2401.01335)

	First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.

	I then performed the following steps 3 times:
	1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
	2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. A standard CE loss over the ground-truth was included to prevent excessive drift.

	This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).

	### Hyperparameters

	For the initial supervised finetuning step:
	- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
	- Lambda (Adalite's analogue to weight decay) of 0.01
	- LR of 1e-5
	- MixCE ratio of 0.75
	- Sequence length of 4096
	- Cosine decay with a 20% warmup
	- Frozen embeddings
	- No training on inputs
	- Accumulated batch size of 128
	- NEFTune with an alpha of 10

	For the generations:
	- Generated using the current git version of `vllm`
	- N=8
	- Temperature of 0.5
	- `top_p` of 0.8
	- Maximum of 512 generated tokens, discarding responses that do not have a valid rationale and answer

	For the rank finetuning:
	- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
	- Lambda of 0.01
	- LR of 5e-7
	- Rank loss weight of 5
	- Sequence length of 1024
	- Cosine schedule with 10% warmup
	- Frozen embeddings
	- No training on inputs
	- Accumulated batch size of 128
	- NEFTune with an alpha of 10