Text Generation
Transformers
Inference Endpoints
ProdeusUnity commited on
Commit
65ce0c2
1 Parent(s): 1a81f5e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ datasets:
4
+ - Mielikki/Erebus-87k
5
+ - allura-org/r_shortstories_24k
6
+ base_model:
7
+ - UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
8
+ pipeline_tag: text-generation
9
+ library_name: transformers
10
+ ---
11
+
12
+ <img src="image_27.png" alt="A beautiful witch writing a book with a quill">
13
+ <sub>Image by CalamitousFelicitouness</sub>
14
+
15
+ ---
16
+
17
+ # Gemma-2-9B Sugarquill v0
18
+
19
+ An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web.
20
+ I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well.
21
+ Should be usable both for RP and raw completion storywriting.
22
+ I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.
23
+
24
+ Model was trained by Auri.
25
+
26
+ Decicated to Cahvay, who wanted a Gemma finetune from me for months now, and to La Rata, who loves storywriter models.
27
+
28
+ **Training notes**
29
+
30
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA.
31
+ I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen.
32
+ Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.
33
+
34
+ **Format**
35
+
36
+ Model responds to Gemma instruct formatting, exactly like it's base model.
37
+
38
+ ```
39
+ <bos>
40
+ <start_of_turn>user{user message}<end_of_turn>
41
+ <start_of_turn>model{response}<end_of_turn>
42
+ <eos>
43
+ ```
44
+
45
+ **Training config**
46
+ <details><summary>See LLaMA-Factory config</summary>
47
+
48
+ ```yaml
49
+ ### Model
50
+ model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
51
+ #ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
52
+ #ref_model_quantization_bit: 8 # 8 or 4
53
+
54
+ ### Method
55
+ stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
56
+ do_train: true
57
+ finetuning_type: lora # full, freeze or lora
58
+ lora_target: all
59
+ #pref_beta: 0.1
60
+ #pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge
61
+
62
+ ### Reward model
63
+ #reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
64
+ #reward_model_type: full # full, lora, api
65
+ #reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
66
+ #reward_model_quantization_bit: 8 # 4 or 8
67
+
68
+ ### Freeze
69
+ #freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
70
+ #freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
71
+ #freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate
72
+
73
+ ### LoRA
74
+ #loraplus_lr_ratio: 8.0
75
+ #loraplus_lr_embedding:
76
+ use_dora: false
77
+ use_rslora: true
78
+ lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
79
+ lora_alpha: 32
80
+ lora_dropout: 0.05
81
+ #pissa_init: true
82
+ #pissa_iter: 16
83
+ #pissa_convert: true
84
+
85
+ ### QLoRA
86
+ quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
87
+ quantization_method: hqq # bitsandbytes or hqq
88
+
89
+ ### DeepSpeed
90
+ deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu
91
+
92
+ ### Dataset
93
+ dataset: sugarquill-10k # define in data/dataset_info.json
94
+ cutoff_len: 8192
95
+ max_samples: 10000
96
+ overwrite_cache: true
97
+ preprocessing_num_workers: 16
98
+ #template: chatml
99
+
100
+ ### Output
101
+ output_dir: saves/gemma/lora/sugarquill-1
102
+ logging_steps: 3
103
+ save_steps: 50
104
+ plot_loss: true
105
+ compute_accuracy: true
106
+ overwrite_output_dir: true
107
+
108
+ ### Train
109
+ per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
110
+ gradient_accumulation_steps: 8
111
+ learning_rate: 3.0e-5
112
+ optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
113
+ num_train_epochs: 2.0
114
+ lr_scheduler_type: cosine # cosine, constant or linear
115
+ warmup_ratio: 0.05
116
+ bf16: true
117
+ ddp_timeout: 180000000
118
+ packing: true
119
+ max_grad_norm: 1.0
120
+
121
+ ### Opts
122
+ flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
123
+ enable_liger_kernel: true # Pretty much must have if it works
124
+ #use_unsloth: true # May not work with multigpu idk
125
+ #use_adam_mini: true # Comment optim if using this
126
+
127
+ ### Eval
128
+ val_size: 0.1
129
+ per_device_eval_batch_size: 1
130
+ eval_strategy: steps
131
+ eval_steps: 0.05
132
+
133
+ ### Misc
134
+ include_num_input_tokens_seen: true
135
+ ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
136
+ upcast_layernorm: true
137
+
138
+ ### Inference for PPO
139
+ #max_new_tokens: 512
140
+ #temperature: 0.8
141
+ #top_k: 0
142
+ #top_p: 0.8
143
+
144
+ ### Tracking
145
+ report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
146
+ run_name: G2-9B-Sugarquill-1
147
+
148
+ ### Merge Adapter
149
+ #export_dir: models/G2-9B-Sugarquill
150
+ #export_size: 4
151
+ #export_device: gpu
152
+ #export_legacy_format: false
153
+
154
+ ```
155
+
156
+ </details>