File size: 5,237 Bytes
33b3b99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c01c14f
 
 
33b3b99
c01c14f
33b3b99
 
 
 
 
c01c14f
 
33b3b99
 
bfcf4ca
 
 
 
 
33b3b99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: apache-2.0
datasets:
- Mielikki/Erebus-87k
- allura-org/r_shortstories_24k
language:
- en
base_model:
- arcee-ai/SuperNova-Medius
library_name: transformers
pipeline_tag: text-generation
---


# Qwen2.5-14B Sugarquill v1

A continued pretrain of SuperNova-Medius on assorted short story data from the web. Supernova already had a nice prose, but diversifying it a bit definitely doesn't hurt.
Also, finally a storywriter model with enough context for something more than a short story, that's also nice.

It's a fair bit more temperamental than Gemma, but can be tamed with some sampling.
Instruction following also stayed rather strong, so it works for both RP and storywriting, both in chat mode via back-and-forth co-writing and on raw completion.

Overall, I'd say it successfully transfers the essence of what I liked about Gemma Sugarquill. I will also make a Qwen version of Aletheia, but with a brand new LoRA, based on a brand new RP dataset that's in the making right now.


Model was trained by Auri.

---

**Training notes**

This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. I've also normalized punctuation to ASCII on the train split, so mismatched quote marks should not be an issue anymore. Also normalized whitespaces, so double spaces after period should be gone as well.

It was trained on 5x3090Ti workstation for 7.5 hours with rsLoRA. I switched back to Axolotl for this run, as LF just plain refused to run at all on this workstation. Also, it's a bf16 LoRA this time. Overall training went much smoother than last time. I've attempted to train Qwen Sugarquill several times before, but loss jumped like crazy. Effective batch size of 40, rsLoRA and paged_ademamix_8bit optimizer seemingly completely solved this issue. 

Thanks to Kearm for providing compute for this training run!

**Format**

Model responds to ChatML instruct formatting, exactly like it's base model.

```
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>
```

**Recommended Samplers**

I found this configuration to be quite stable:

```
Temperature - 0.8
Min-P - 0.05
Top-A - 0.3
Repetition Penalty - 1.03
```

Feel free to toy around with samplers after you get a feel for it. It seems to like Top-A and Smooth Sampling quite a bit.

**Training config**
<details><summary>See Axolotl config</summary>

axolotl version: `0.4.1`
```yaml
# Model
base_model: arcee-ai/SuperNova-Medius
strict: false

# Liger Kernels (optimization)
plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

# Output and HuggingFace
output_dir: /home/kearm/axolotl/TQ-2.5-14B-Sugarquill
hub_model_id: allura-org/TQ-2.5-14B-Sugarquill-LoRA
hf_use_auth_token: true
hub_strategy: "all_checkpoints"

# WandB
wandb_project: huggingface
wandb_entity:
wandb_name: TQ-2.5-14B-Sugarquill-1

# Data
#chat_template: chatml
#train_on_inputs: false
group_by_length: false
datasets:
  - path: allura-org/sugarquill-10k
    type: completion

## Evaluation
val_set_size: 0.01
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128

# Technical aspects
sequence_len: 8192
save_safetensors: true
saves_per_epoch: 2
logging_steps: 1
special_tokens:

# Quantization
bf16: auto
fp16:
tf32: false
## For LoRA
load_in_8bit: false
load_in_4bit: false

# LoRA
peft_use_rslora: true
peft_use_dora: false # better but slower
adapter: lora # lora or qlora
lora_model_dir:
lora_r: 64 # 64 is optimal for most trains on instruct
lora_alpha: 32
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
#  - embed_tokens
#  - lm_head

#loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
#loraplus_lr_embedding:

# Training hyperparameters
# max_steps:
num_epochs: 2

# Anti Overfit and Stability
weight_decay: 0.01
max_grad_norm: 1.0

## Learning Rate
warmup_ratio: 0.05
learning_rate: 0.00003
lr_scheduler: cosine
#lr_scheduler_kwargs:
#    min_lr: 0.0000024
optimizer: paged_ademamix_8bit # usually adamw_torch or paged_adamw_8bit

## Batch Size
gradient_accumulation_steps: 8      # More effective batch size - stabler train, usually. MBS also speeds it up.
micro_batch_size: 1                 # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
eval_batch_size: 1

# Optimizations
pad_to_sequence_len: true
sample_packing: true
eval_sample_packing: false
flash_attention: true
xformers_attention:
gradient_checkpointing: "unsloth"
gradient_checkpointing_kwargs:
   use_reentrant: true
local_rank:
deepspeed: /home/kearm/axolotl/deepspeed_configs/zero3_bf16.json # Only use with multi gpu # _bf16_cpuoffload_all
# fsdp:
#   - full_shard
#   - auto_wrap
# fsdp_config:
#   fsdp_limit_all_gathers: true
#   fsdp_sync_module_states: true
#   fsdp_offload_params: true
#   fsdp_use_orig_params: false
#   fsdp_cpu_ram_efficient_loading: true
#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
#   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
#   fsdp_state_dict_type: FULL_STATE_DICT
#   fsdp_sharding_strategy: FULL_SHARD
# Misc
early_stopping_patience:
debug:
```

</details>