Update README.md

5997675 over 1 year ago

11.1 kB

	---
	license: llama2
	language:
	- en
	datasets:
	- OpenAssistant/oasst1
	- ehartford/dolphin
	- rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored
	- argilla/databricks-dolly-15k-curated-multilingual
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- sft
	---
	# Open-Assistant Llama2 70B SFT v10

	This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
	The model was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks data and in a 2nd "finishing" stage
	on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see configuration details section below).

	## Model Details

	- Finetuned from: [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
	- Model type: Causal decoder-only transformer language model
	- Language: English (and limited capabilities in German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish)
	- Weights & Biases training logs: [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
	- Demo: [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
	- Evaluation [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
	- License: [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
	- Contact: [Open-Assistant Discord](https://ykilcher.com/open-assistant-discord)


	## Prompting / Prompt Template

	The model was trained with OpenAI's [chatml](https://github.com/openai/openai-python/blob/main/chatml.md) prompt format:
	"<\|im_start\|>system\n{system_message}<im_end>\n<\|im_start\|>user\n{user prompt}<\|im_end\|>\n<\|im_start\|>assistant\n{Assistant answer}<\|im_end\|>\n"


	Multi-line:

	```
	<\|im_start\|>system
	{system_message}<\|im_end\|>
	<\|im_start\|>user
	{user prompt}<\|im_end\|>
	<\|im_start\|>assistant
	{Assistant answer}<\|im_end\|>
	```

	The model was partly trained with orca system messages. For inference we can recommend the official [llama2 system prompt](https://github.com/facebookresearch/llama/blob/ea9f33d6d3ea8ed7d560d270986407fd6c2e52b7/example_chat_completion.py#L57-L61):
	```
	<\|im_start\|>system
	You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
	If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
	<\|im_end\|>
	```

	### Credits & Special Thanks

	- Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
	- The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
	- [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
	- [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) and the [ehartford/oa_leet10k](https://huggingface.co/datasets/ehartford/oa_leet10k) datasets.
	- [Argilla](https://huggingface.co/argilla) curated and published the [argilla/databricks-dolly-15k-curated-multilingual] dataset.
	- [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin dataset with a cluster-center approach and generated the orca-best (ocra-chat) dataset.
	- [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.

	We want to especially thank everyone who contributed in the crowed-sourced Open-Assistant dataset creation on https://open-assistant.io/ - without you this project would not have been possible.

	## Ethical Considerations and Limitations

	Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
	For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
	in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
	to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
	perform safety testing and tuning tailored to their specific applications of the model.

	Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).


	## Configuration Details

	The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).


	### Stage 1 Pretokenizer Configuration

	Entries of the dataset with assistant replies shorter than 25 tokens were excluded from training.

	```
	oasst_pre10_min25:
	datasets:
	- megacode2:
	fraction: 0.5
	val_split: 0.01
	max_val_set: 1000
	- orca-chat:
	val_split: 0.01
	max_val_set: 1000
	- dolly15k_multilingual:
	val_split: 0.05
	max_val_set: 300
	- oa_leet10k:
	val_split: 0.05
	max_val_set: 250
	output_dir: "output/oasst_pre10_min25"
	filename_prefix: "oasst_pre10"
	min_assistant_tokens: 25
	```

	Stage 1 dataset statistics:
	```
	# Stats for output/oasst_pre10_min25_llama2

	## Stats for 'Subset of InstructionDataset (megacode2)' (466364 samples (50.0%))
	-----------------
	Accepted: 398223/466364 (85.4%)
	Accepted tokens: 167676873
	Skipped: 68141 (14.6%)
	Min tokens per sample: 36
	Max tokens per sample: 11810
	Avg tokens per sample: 421.063
	-----------------

	## Stats for 'Subset of OrcaChat (orca-chat)' (325616 samples (100.0%))
	-----------------
	Accepted: 325616/325616 (100.0%)
	Accepted tokens: 178307574
	Skipped: 0 (0.0%)
	Min tokens per sample: 105
	Max tokens per sample: 10408
	Avg tokens per sample: 547.601
	-----------------

	## Stats for 'Subset of Dolly15kMultilingual' (57020 samples (100.0%))
	-----------------
	Accepted: 47494/57020 (83.3%)
	Accepted tokens: 13883177
	Skipped: 9526 (16.7%)
	Min tokens per sample: 34
	Max tokens per sample: 9172
	Avg tokens per sample: 292.314
	-----------------

	## Stats for 'Subset of InstructionDataset (oa_leet10k)' (22236 samples (100.0%))
	-----------------
	Accepted: 22236/22236 (100.0%)
	Accepted tokens: 15905296
	Skipped: 0 (0.0%)
	Min tokens per sample: 168
	Max tokens per sample: 10588
	Avg tokens per sample: 715.295
	-----------------

	## Stats for 'total' (871236 samples (100.0%))
	-----------------
	Accepted: 793569/871236 (91.1%)
	Accepted tokens: 375772920
	Skipped: 77667 (8.9%)
	Min tokens per sample: 34
	Max tokens per sample: 11810
	Avg tokens per sample: 473.523
	-----------------
	```


	### Stage 2 Pretokenizer Configuration

	```
	oasst_top1:
	datasets:
	- oasst_export:
	lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
	input_file_path: 2023-07-23_oasst_ready.tar.gz
	top_k: 1
	val_split: 0.05
	output_dir: "output/oasst_top1_2023-07-23"
	filename_prefix: "oasst_top1"
	```

	Stage 2 dataset statistics:

	```
	# Stats for output/oasst_top1_2023-07-23_llama2

	## Stats for 'ListDataset' (11441 samples (100.0%))
	-----------------
	Accepted: 11441/11441 (100.0%)
	Accepted tokens: 5315368
	Skipped: 0 (0.0%)
	Min tokens per sample: 20
	Max tokens per sample: 5407
	Avg tokens per sample: 464.58945896337735
	-----------------

	## Stats for 'total' (11441 samples (100.0%))
	-----------------
	Accepted: 11441/11441 (100.0%)
	Accepted tokens: 5315368
	Skipped: 0 (0.0%)
	Min tokens per sample: 20
	Max tokens per sample: 5407
	Avg tokens per sample: 464.58945896337735
	-----------------
	```


	### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
	```
	--tensor_model_parallel_size 8
	--pipeline_model_parallel_size 4
	--load ./checkpoints/llama2-70b-tp8-pp4
	--save ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
	--tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10/logging
	--data_path ./data/oasst_pre10_min25_llama2/oasst_sft10-train
	--model_name llama2
	--tokenizer_type SentencePieceTokenizer
	--bf16
	--global_batch_size 64
	--micro_batch_size 2
	--vocab_file=./llama2/Llama-2-7b/tokenizer.model
	--use_rms_norm
	--glu_activation swiglu
	--no_tie_embed_logits
	--vocab_extra_ids_list "\"<\|im_start\|>,<\|im_end\|>\""
	--layernorm_epsilon 1e-5
	--use_flash_attn
	--no_bias_gelu_fusion
	--seq_length 4096
	--max_position_embeddings 4096
	--log_interval 1
	--save_interval 500
	--eval_interval 50
	--eval_iters 10
	--hidden_dropout 0.0
	--position_embedding_type rotary
	--no_bias_dropout_fusion
	--use_checkpoint_args
	--train_iters 12000
	--attention_dropout 0.0
	--adam_beta1 0.9
	--adam_beta2 0.95
	--adam_eps 1e-12
	--lr_decay_style cosine
	--lr_warmup_iters 100
	--lr 1e-5
	--min_lr 1e-6
	--weight_decay 0.000001
	--sequence_parallel
	--recompute_granularity selective
	--log_timers_to_tensorboard
	--rope_scaling_factor 1.0
	--wandb_logger
	```

	### Megatron Fine-Tuning Arguments for Stage 2 (OASST Polishing, LIMA Dropout):
	```
	--tensor_model_parallel_size 8
	--pipeline_model_parallel_size 4
	--load ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
	--save ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10
	--tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10/logging
	--data_path ./data/oasst_top1_2023-07-23_llama2/oasst_top1-train
	--model_name llama2
	--tokenizer_type SentencePieceTokenizer
	--bf16
	--global_batch_size 64
	--micro_batch_size 2
	--vocab_file=./llama2/Llama-2-7b/tokenizer.model
	--use_rms_norm
	--glu_activation swiglu
	--no_tie_embed_logits
	--vocab_extra_ids_list "\"<\|im_start\|>,<\|im_end\|>\""
	--layernorm_epsilon 1e-5
	--use_flash_attn
	--no_bias_gelu_fusion
	--seq_length 4096
	--max_position_embeddings 4096
	--log_interval 1
	--save_interval 346
	--eval_interval 50
	--eval_iters 10
	--hidden_dropout 0.25
	--lima_dropout
	--position_embedding_type rotary
	--no_bias_dropout_fusion
	--use_checkpoint_args
	--train_iters 519
	--attention_dropout 0.0
	--adam_beta1 0.9
	--adam_beta2 0.95
	--adam_eps 1e-12
	--lr_decay_style cosine
	--lr_warmup_iters 100
	--lr 1e-5
	--min_lr 1e-6
	--weight_decay 0.000001
	--sequence_parallel
	--recompute_granularity selective
	--log_timers_to_tensorboard
	--rope_scaling_factor 1.0
	--finetune
	--wandb_logger
	```