Aura-MoE-2x4B-v2 / README.md

Update README.md

f15c260 verified 12 days ago

5.12 kB

	---
	license: apache-2.0
	datasets:
	- jeiku/Writing
	- FourOhFour/RP_Phase
	- anthracite-core/full-opus-chosen-hermes-rejected-kto-v1
	language:
	- en
	base_model:
	- IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml
	---
	## Aura-MoE-2x4B-v2

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/626dfb8786671a29c715f8a9/zyGqa-iH77dgU9D8WvoXY.png)

	## Introduction

	Aura-MoE-2x4B-v2 is a state of the art dedicated roleplaying model designed to fulfill your every desire.

	The finetunes used in this merge saw several hundreds of millions of tokens of instruction data. The merge was then healed on 150 million tokens of roleplaying data. A Kahneman-Tversky Optimization was applied to the healed model to give it a unique output style.

	By the numbers, this should be a direct improvement over [Aura-MoE-2x4B](https://huggingface.co/AuraIndustries/Aura-MoE-2x4B)

	Developed by Aura Industries, with contributions from Anthracite Org

	## Model Details

	- Model Name: Aura-MoE-2x4B-v2
	- Base Model: [IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml](https://huggingface.co/IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml)
	- Model Type: Chat Completions
	- Prompt Format: ChatML
	- License: Apache-2.0
	- Language: English
	- Max Context: 8,192+ tokens

	## License

	This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

	## Quantizations

	[Static GGUF](https://huggingface.co/mradermacher/Aura-MoE-2x4B-v2-GGUF)

	[Imatrix GGUF](https://huggingface.co/mradermacher/Aura-MoE-2x4B-v2-i1-GGUF)

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)

	Coming soon...

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \| N/A\|
	\|IFEval (0-Shot) \| N/A\|
	\|BBH (3-Shot) \| N/A\|
	\|MATH Lvl 5 (4-Shot)\| N/A\|
	\|GPQA (0-shot) \| N/A\|
	\|MuSR (0-shot) \| N/A\|
	\|MMLU-PRO (5-shot) \| N/A\|

	## Training Configuration

	<details><summary>Click here for Mergekit and Axolotl configs</summary>

	MoE Merge

	```yaml
	base_model: FourOhFour/Zenith_4B
	gate_mode: random
	dtype: bfloat16
	experts_per_token: 1
	experts:
	- source_model: FourOhFour/Luxe_4B
	- source_model: FourOhFour/Zenith_4B
	```

	SFT

	```yaml
	base_model: jeiku/MoEv2
	model_type: AutoModelForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: false
	strict: false

	datasets:
	- path: FourOhFour/RP_Phase
	type: chat_template
	chat_template: chatml
	roles_to_train: ["gpt"]
	field_messages: conversations
	message_field_role: from
	message_field_content: value
	train_on_eos: turn
	- path: jeiku/Writing
	type: completion
	field: text

	chat_template: chatml

	shuffle_merged_datasets: true
	dataset_prepared_path:
	val_set_size: 0.01
	output_dir: ./output/out

	hub_model_id: jeiku/Aura-MoEv2
	hub_strategy: "all_checkpoints"
	push_dataset_to_hub:
	hf_use_auth_token: true

	sequence_len: 8192
	sample_packing: true
	eval_sample_packing: false
	pad_to_sequence_len:

	wandb_project: Aura-MoEv2
	wandb_entity:
	wandb_watch:
	wandb_name: Aura-MoEv2
	wandb_log_model:

	gradient_accumulation_steps: 16
	micro_batch_size: 2
	num_epochs: 2
	optimizer: paged_adamw_8bit
	lr_scheduler: cosine
	learning_rate: 0.00005

	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: false

	gradient_checkpointing: true
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	warmup_steps: 10
	evals_per_epoch: 2
	eval_table_size:
	eval_max_new_tokens:
	saves_per_epoch: 1
	debug:
	deepspeed:
	weight_decay: 0.05
	fsdp:
	fsdp_config:
	special_tokens:
	pad_token: <\|finetune_right_pad_id\|>
	```

	KTO

	```yaml
	base_model: jeiku/Aura-MoEv2
	model_type: AutoModelForCausalLM
	tokenizer_type: AutoTokenizer

	load_in_8bit: false
	load_in_4bit: false
	strict: false

	hub_model_id: jeiku/moekto
	hub_strategy: "all_checkpoints"
	push_dataset_to_hub:
	hf_use_auth_token: true

	chat_template: chatml

	rl: kto
	rl_beta: 0.2
	kto_desirable_weight: 0.2

	datasets:
	- path: anthracite-core/full-opus-chosen-hermes-rejected-kto-v1
	type: chatml.argilla

	shuffle_merged_datasets: true
	val_set_size: 0.0
	output_dir: ./outputs/out

	sequence_len: 8192
	sample_packing: false
	eval_sample_packing: false
	pad_to_sequence_len: false

	wandb_project: moekto
	wandb_entity:
	wandb_watch:
	wandb_name: moekto
	wandb_log_model:

	gradient_accumulation_steps: 16
	micro_batch_size: 2
	num_epochs: 2
	max_steps: 500

	optimizer: adamw_8bit
	lr_scheduler: cosine
	learning_rate: 0.00001
	weight_decay: 0.05

	train_on_inputs: false
	group_by_length: false
	bf16: auto
	fp16:
	tf32: true

	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: true
	remove_unused_columns: false
	early_stopping_patience:
	resume_from_checkpoint:
	local_rank:
	logging_steps: 1
	xformers_attention:
	flash_attention: true

	warmup_steps: 10
	evals_per_epoch: 2
	eval_table_size:
	eval_max_new_tokens:
	saves_per_epoch: 1

	debug:
	deepspeed:
	fsdp:
	fsdp_config:
	fsdp:
	fsdp_config:

	special_tokens:
	pad_token: <\|finetune_right_pad_id\|>
	```
	</details><br>