AuriAetherwiing commited on
Commit
33b3b99
1 Parent(s): 9d34f97

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +179 -0
README.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Mielikki/Erebus-87k
5
+ - allura-org/r_shortstories_24k
6
+ language:
7
+ - en
8
+ base_model:
9
+ - arcee-ai/SuperNova-Medius
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+
15
+ # Qwen2.5-14B Sugarquill v1
16
+
17
+ A continued pretrain of SuperNova-Medius on assorted short story data from the web. Supernova already had a nice prose, but diversifying it a bit definitely doesn't hurt.
18
+ Also, finally a storywriter model with enough context for something more than a short story, that's also nice. It's a fair bit more temperamental than Gemma, but can be tamed with some sampling.
19
+ Instruction following also stayed rather strong, so it works for both RP and storywriting, both in chat mode via back-and-forth co-writing and on raw completion.
20
+ Overall, I'd say it successfully transfers the essence of what I liked about Gemma Sugarquill. I will also make a Qwen version of Aletheia, but with a brand new LoRA, based on a brand new RP dataset that's in the making right now.
21
+
22
+
23
+ Model was trained by Auri.
24
+
25
+ **Training notes**
26
+
27
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 5x3090Ti workstation for 7.5 hours with rsLoRA.
28
+ I switched back to Axolotl for this run, as LF just plain refused to run at all on this workstation. Also, it's a bf16 LoRA this time. Overall training went much smoother than last time. I've attempted to train Qwen Sugarquill several times before, but loss jumped like crazy. Effective batch size of 40, rsLoRA and paged_ademamix_8bit optimizer seemingly completely solved this issue.
29
+ Thanks to Kearm for providing compute for this training run.
30
+
31
+ **Format**
32
+
33
+ Model responds to ChatML instruct formatting, exactly like it's base model.
34
+
35
+ ```
36
+ <|im_start|>user
37
+ {user message}<|im_end|>
38
+ <|im_start|>assistant
39
+ {response}<|im_end|>
40
+ ```
41
+
42
+ **Recommended Samplers**
43
+
44
+ I found this configuration to be quite stable:
45
+
46
+ ```
47
+ Temperature - 0.8
48
+ Min-P - 0.05
49
+ Top-A - 0.3
50
+ Repetition Penalty - 1.03
51
+ ```
52
+
53
+ Feel free to toy around with samplers after you get a feel for it. It seems to like Top-A and Smooth Sampling quite a bit.
54
+
55
+ **Training config**
56
+ <details><summary>See Axolotl config</summary>
57
+
58
+ axolotl version: `0.4.1`
59
+ ```yaml
60
+ # Model
61
+ base_model: arcee-ai/SuperNova-Medius
62
+ strict: false
63
+
64
+ # Liger Kernels (optimization)
65
+ plugins:
66
+ - axolotl.integrations.liger.LigerPlugin
67
+ liger_rope: true
68
+ liger_rms_norm: true
69
+ liger_swiglu: true
70
+ liger_fused_linear_cross_entropy: true
71
+
72
+ # Output and HuggingFace
73
+ output_dir: /home/kearm/axolotl/TQ-2.5-14B-Sugarquill
74
+ hub_model_id: allura-org/TQ-2.5-14B-Sugarquill-LoRA
75
+ hf_use_auth_token: true
76
+ hub_strategy: "all_checkpoints"
77
+
78
+ # WandB
79
+ wandb_project: huggingface
80
+ wandb_entity:
81
+ wandb_name: TQ-2.5-14B-Sugarquill-1
82
+
83
+ # Data
84
+ #chat_template: chatml
85
+ #train_on_inputs: false
86
+ group_by_length: false
87
+ datasets:
88
+ - path: allura-org/sugarquill-10k
89
+ type: completion
90
+
91
+ ## Evaluation
92
+ val_set_size: 0.01
93
+ evals_per_epoch: 4
94
+ eval_table_size:
95
+ eval_max_new_tokens: 128
96
+
97
+ # Technical aspects
98
+ sequence_len: 8192
99
+ save_safetensors: true
100
+ saves_per_epoch: 2
101
+ logging_steps: 1
102
+ special_tokens:
103
+
104
+ # Quantization
105
+ bf16: auto
106
+ fp16:
107
+ tf32: false
108
+ ## For LoRA
109
+ load_in_8bit: false
110
+ load_in_4bit: false
111
+
112
+ # LoRA
113
+ peft_use_rslora: true
114
+ peft_use_dora: false # better but slower
115
+ adapter: lora # lora or qlora
116
+ lora_model_dir:
117
+ lora_r: 64 # 64 is optimal for most trains on instruct
118
+ lora_alpha: 32
119
+ lora_dropout: 0.1
120
+ lora_target_linear: true
121
+ lora_fan_in_fan_out:
122
+ lora_target_modules:
123
+ # - embed_tokens
124
+ # - lm_head
125
+
126
+ #loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
127
+ #loraplus_lr_embedding:
128
+
129
+ # Training hyperparameters
130
+ # max_steps:
131
+ num_epochs: 2
132
+
133
+ # Anti Overfit and Stability
134
+ weight_decay: 0.01
135
+ max_grad_norm: 1.0
136
+
137
+ ## Learning Rate
138
+ warmup_ratio: 0.05
139
+ learning_rate: 0.00003
140
+ lr_scheduler: cosine
141
+ #lr_scheduler_kwargs:
142
+ # min_lr: 0.0000024
143
+ optimizer: paged_ademamix_8bit # usually adamw_torch or paged_adamw_8bit
144
+
145
+ ## Batch Size
146
+ gradient_accumulation_steps: 8 # More effective batch size - stabler train, usually. MBS also speeds it up.
147
+ micro_batch_size: 1 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
148
+ eval_batch_size: 1
149
+
150
+ # Optimizations
151
+ pad_to_sequence_len: true
152
+ sample_packing: true
153
+ eval_sample_packing: false
154
+ flash_attention: true
155
+ xformers_attention:
156
+ gradient_checkpointing: "unsloth"
157
+ gradient_checkpointing_kwargs:
158
+ use_reentrant: true
159
+ local_rank:
160
+ deepspeed: /home/kearm/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
161
+ # fsdp:
162
+ # - full_shard
163
+ # - auto_wrap
164
+ # fsdp_config:
165
+ # fsdp_limit_all_gathers: true
166
+ # fsdp_sync_module_states: true
167
+ # fsdp_offload_params: true
168
+ # fsdp_use_orig_params: false
169
+ # fsdp_cpu_ram_efficient_loading: true
170
+ # fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
171
+ # fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
172
+ # fsdp_state_dict_type: FULL_STATE_DICT
173
+ # fsdp_sharding_strategy: FULL_SHARD
174
+ # Misc
175
+ early_stopping_patience:
176
+ debug:
177
+ ```
178
+
179
+ </details>