File size: 5,060 Bytes
1701067
 
 
 
 
 
 
 
8f06683
1701067
 
 
 
77bb14d
1701067
 
 
 
2caa7a2
4e510dd
cf31964
1701067
 
 
2caa7a2
1701067
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: gemma
datasets:
- Mielikki/Erebus-87k
- allura-org/r_shortstories_24k
base_model:
- UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
pipeline_tag: text-generation
library_name: transformers
---

# Gemma-2-9B Sugarquill v0

An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web.
I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well.
Should be usable both for RP and raw completion storywriting.
I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.

Model was trained by Auri.

**Training notes**

This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA.
I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen.
Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.

**Format**

Model responds to Gemma instruct formatting, exactly like it's base model.

```
  <bos>
  <start_of_turn>user{user message}<end_of_turn>
  <start_of_turn>model{response}<end_of_turn>
  <eos>
```

**Training config**
<details><summary>See LLaMA-Factory config</summary>
  
```yaml
### Model
model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
#ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
#ref_model_quantization_bit: 8 # 8 or 4

### Method
stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
do_train: true
finetuning_type: lora # full, freeze or lora
lora_target: all
#pref_beta: 0.1
#pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge

### Reward model
#reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
#reward_model_type: full # full, lora, api
#reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
#reward_model_quantization_bit: 8 # 4 or 8

### Freeze
#freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
#freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
#freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate

### LoRA
#loraplus_lr_ratio: 8.0
#loraplus_lr_embedding:
use_dora: false
use_rslora: true
lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
lora_alpha: 32
lora_dropout: 0.05
#pissa_init: true
#pissa_iter: 16
#pissa_convert: true

### QLoRA
quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
quantization_method: hqq # bitsandbytes or hqq

### DeepSpeed
deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu

### Dataset
dataset: sugarquill-10k # define in data/dataset_info.json
cutoff_len: 8192
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16
#template: chatml

### Output
output_dir: saves/gemma/lora/sugarquill-1
logging_steps: 3
save_steps: 50
plot_loss: true
compute_accuracy: true
overwrite_output_dir: true

### Train
per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
gradient_accumulation_steps: 8
learning_rate: 3.0e-5
optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
num_train_epochs: 2.0
lr_scheduler_type: cosine # cosine, constant or linear
warmup_ratio: 0.05
bf16: true
ddp_timeout: 180000000
packing: true
max_grad_norm: 1.0

### Opts
flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
enable_liger_kernel: true # Pretty much must have if it works
#use_unsloth: true # May not work with multigpu idk
#use_adam_mini: true # Comment optim if using this

### Eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 0.05

### Misc
include_num_input_tokens_seen: true
ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
upcast_layernorm: true

### Inference for PPO
#max_new_tokens: 512
#temperature: 0.8
#top_k: 0
#top_p: 0.8

### Tracking
report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
run_name: G2-9B-Sugarquill-1

### Merge Adapter
#export_dir: models/G2-9B-Sugarquill
#export_size: 4
#export_device: gpu
#export_legacy_format: false

```

</details>